# Troubleshooting Flow: pg_server___high_cpu_v2
This runbook based on this post [Troubleshooting High CPU Utilization in PostgreSQL Databases: A How-To Guide](https://jfrog.com/community/data-science/troubleshooting-high-cpu-utilization-in-postgresql-databases-a-how-to-guide/) by Dmitry Romanoff

It is provided as-is. Future versions of the runbook will provide more in-depth analysis

## SQL Query #1 – Connections summary
**Query Desc:** One of the patterns of PostgreSQL DB’s leading to high CPU utilization is a high number of active connections  
**Analysis:** The recommendation for SQL Query one is to examine running sessions on the PostgreSQL DB instance, trying to identify and analyze using EXPLAIN long-running, badly written, too-frequent, queries. In case the number of active connections is more than one per CPU core it’s recommended to check and tune the application(s) working with the DB.

In [1]:
import sqlalchemy
import pandas as pd
import configparser
import matplotlib.pyplot as plt 

# Read from the Config file
config = configparser.ConfigParser() 
config.read_file(open(r'../ipynb.cfg'))

con_str = config.get('con_str', 'PG_AIRBASES') 
engine = sqlalchemy.create_engine(con_str)

try:
    connection = engine.connect()
    print ("Opened Connection")
except (Exception, sqlalchemy.exc.SQLAlchemyError) as error:
    print("Error while connecting to PostgreSQL database:", error)


Opened Connection


In [2]:
qry_connections = """
select 
    A.total_connections, 
    A.non_idle_connections, 
    B.max_connections,
    round((100 * A.total_connections::numeric / B.max_connections::numeric), 2) connections_utilization_pctg
from
  (select count(1) as total_connections, sum(case when state!='idle' then 1 else 0 end) as non_idle_connections from pg_stat_activity) A,
  (select setting as max_connections from pg_settings where name='max_connections') B;
 """
df = pd.read_sql_query(qry_connections, connection)
df

Unnamed: 0,total_connections,non_idle_connections,max_connections,connections_utilization_pctg
0,55,1,835,6.59


## SQL Query #2 – Distribution of non-idle connections per database
**Query Desc:** Use the below query to check the distribution of non-idle connections number per database, sorted in descending order  
**Analysis:** The recommendation in such a case would be to examine running sessions of the top database on the PostgreSQL DB instance, trying to identify long-running, badly written, too-frequent queries.

In [3]:
qry_non_idle_connection = """
select 
datname as db_name, 
count(1) as num_non_idle_connections 
from pg_stat_activity 
where state!='idle' 
group by 1 
order by 2 desc;
"""
df = pd.read_sql_query(qry_non_idle_connection, connection)
df

Unnamed: 0,db_name,num_non_idle_connections
0,airbases,1


## SQL Query #3 – Distribution of non-idle connections per database and per query
**Query Desc:** Check the distribution of non-idle connections per database and per query, sorted in descending order  
**Analysis:** The recommendation in such a case would be to examine the SQL queries having the top non-idle connections. It happens that a high number of non-idle connections may appear to indicate ineffective, not scalable architecture or workload, not matching the system resources.
TODO: show the full length of the SQL in the dataframe 

In [4]:
qry_non_idle_connections_by_query = """
select 
datname as db_name, 
substr(query, 1, 200) short_query, 
count(1) as num_non_idle_connections 
from pg_stat_activity 
where state!='idle' 
group by 1, 2 
order by 3 desc;
""";
df = pd.read_sql_query(qry_non_idle_connections_by_query, connection)
df

Unnamed: 0,db_name,short_query,num_non_idle_connections
0,airbases,"\nselect \n A.total_connections, \n A.no...",1


## SQL Query #4 – Non-idle sessions detailed
**Query Desc:** List non-idle PostgreSQL sessions that take more than five seconds, sorted by the runtime in descending order  
**Analysis:** In some scenarios, long-running queries can cause high CPU utilization. In these instances, the queries obtained in the resultset should be analyzed and appropriately tuned. 

In case the query runs too long, causing a high load on the DB CPU and other resources, you may want to terminate it explicitly. To terminate a PostgreSQL DB session by <process id> run the following command: ```select pg_terminate_backend(<process_id>);```

In [5]:
qry_non_idle_sessions_details = """ 
select 
	now()-query_start as runtime, 
	pid as process_id, 
	datname as db_name, 
	client_addr,
	client_hostname,
	substr(query, 1, 200) the_query
from pg_stat_activity
where state!='idle'
and now() - query_start > '5 seconds'::interval
order by 1 desc; """ 

df = pd.read_sql_query(qry_non_idle_sessions_details, connection)
df

Unnamed: 0,runtime,process_id,db_name,client_addr,client_hostname,the_query


# SQL Query #5 – Running frequent SQL queries
The root cause of high CPU utilization in PostgreSQL databases may not be a necessary long-running query. Quick, but too frequent queries running hundreds of times per second can cause high CPU utilization too. 

In [6]:
qry_frequent_sql_calls = """ 
with
a as (select dbid, queryid, query, calls s from pg_stat_statements),
b as (select dbid, queryid, query, calls s from pg_stat_statements, pg_sleep(1))
select
        pd.datname as db_name, 
        substr(a.query, 1, 400) as the_query, 
        sum(b.s-a.s) as runs_per_second
from a, b, pg_database pd
where 
  a.dbid= b.dbid 
and 
  a.queryid = b.queryid 
and 
  pd.oid=a.dbid
group by 1, 2
order by 3 desc; """ 

df = pd.read_sql_query(qry_frequent_sql_calls, connection)
df

Unnamed: 0,db_name,the_query,runs_per_second
0,platform-v2,$1,409566.0
1,airbases-demo,EXPLAIN (FORMAT JSON) SELECT public.load_postg...,47819.0
2,airbases,EXPLAIN (FORMAT JSON) SELECT public.load_postg...,37391.0
3,airbases,EXPLAIN (FORMAT JSON) \n -- Your SQL qu...,27249.0
4,airbases-demo,EXPLAIN (FORMAT JSON) \n SELECT \n\tdat...,25994.0
...,...,...,...
6673,airbases-demo,"SELECT \n\tdatid as dbid, \n datname as db_na...",-25994.0
6674,airbases,--\nINSERT INTO metis.pg_stat_database_snapsho...,-27249.0
6675,airbases,SELECT public.load_postgres_log_files(),-37391.0
6676,airbases-demo,SELECT public.load_postgres_log_files(),-47819.0


# SQL Query #6 – PostgreSQL Database CPU distribution per database, and per query
**Query Desc:** This query checks how much each query in each database uses the CPU. It provides a resultset sorted in descending order by the most CPU-intensive queries.   
**Analysis:** Check SQL queries that use a lot of CPU or time. Also, look for queries with a high mean time and/or a number of calls. ```GRANT pg_read_all_stats TO <db_user>;```

In [2]:
## The query below only works on PG 13 or higher
## For PG 12 or older use this query: https://jfrog.com/community/data-science/troubleshooting-high-cpu-utilization-in-postgresql-databases-a-how-to-guide/

qry_cpu_per_db = """ 
SELECT 
        pss.userid,
        pss.dbid,
        pd.datname as db_name,
        round((pss.total_exec_time + pss.total_plan_time)::numeric, 2) as total_time, 
        pss.calls, 
        round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 2) as mean, 
        round((100 * (pss.total_exec_time + pss.total_plan_time) / sum((pss.total_exec_time + pss.total_plan_time)::numeric) OVER ())::numeric, 2) as cpu_portion_pctg,
        substr(pss.query, 1, 200) short_query
FROM pg_stat_statements pss, pg_database pd 
WHERE pd.oid=pss.dbid
ORDER BY (pss.total_exec_time + pss.total_plan_time)
DESC LIMIT 30;
""" 

df = pd.read_sql_query(qry_cpu_per_db, connection)
df


Unnamed: 0,userid,dbid,db_name,total_time,calls,mean,cpu_portion_pctg,short_query
0,16395,2242855,airbases-demo,3298913000.0,138368,23841.59,42.98,"SELECT departure_airport, booking_id, is_retur..."
1,16395,2242855,airbases-demo,923825000.0,141236,6541.0,12.04,SELECT *\nFROM postgres_air.booking as b\n\tJO...
2,16395,2242855,airbases-demo,863961400.0,138360,6244.3,11.26,select count(*) from postgres_air.boarding_pas...
3,16395,2242855,airbases-demo,601026600.0,141235,4255.51,7.83,SELECT *\nFROM postgres_air.booking as b\n\tJO...
4,16395,71456,airbases,540081000.0,417188,1294.57,7.04,"SELECT relid, schemaname as shchema_name, sut...."
5,16395,2242855,airbases-demo,216289800.0,417337,518.26,2.82,"SELECT relid, schemaname as shchema_name, sut...."
6,16395,2242855,airbases-demo,190891400.0,141240,1351.54,2.49,"--Explain (analyze, timing)\nSELECT *\nFROM po..."
7,16395,14301,postgres,109706500.0,3352749,32.72,1.43,"SELECT blk_read_time, blk_write_time, calls, d..."
8,16395,27603,books,94592860.0,605250,156.29,1.23,"SELECT queryid, query, calls, round(total_exec..."
9,498421,14301,postgres,93692290.0,1632059,57.41,1.22,"SELECT calls, datname, local_blks_dirtied, loc..."


# SQL Query #7 – Check PostgreSQL DB tables statistics
Outdated PostgreSQL statistics can be another root cause for high CPU utilization. When statistical data isn’t updated, the PostgreSQL query planner may generate non-efficient execution plans for queries, which will lead to a bad performance of the entire PostgreSQL DB Server.
**Query Desc:** Checks the last date and time the statistics were updated for each table in the PostgreSQL DB Server for a specific DB

In [8]:
qry_table_statistics = """ 
select
  schemaname,
  relname,
  DATE_TRUNC('minute', last_analyze) last_analyze,
  DATE_TRUNC('minute', last_autoanalyze) last_autoanalyze
from
  pg_stat_all_tables
where
  schemaname = 'public'
order by
  last_analyze desc NULLS FIRST,
  last_autoanalyze desc NULLS FIRST; """ 

df = pd.read_sql_query(qry_table_statistics, connection)
df

Unnamed: 0,schemaname,relname,last_analyze,last_autoanalyze
0,public,index_stats,,NaT
1,public,accounts,,NaT
2,public,qa_tests_flights,,NaT
3,public,orders_y2023m03,,NaT
4,public,qa_table_wings,,NaT
5,public,t1,,NaT
6,public,orders_test,,NaT
7,public,vacuum_logs,,NaT
8,public,orders,,NaT
9,public,sales,,NaT
