Increase database work_mem to 20MB #1744

jwalgran · 2022-03-20T19:33:53Z

Overview

When troubleshooting production outages due to 100% CPU usage we use pgbadger to analyze and inspect the database logs. In the report of temp file usage there were some queries that produced temp files over 4MB in size, the default working memory setting for Postgres. The largest temp file we saw in our logs was just over 16MB.

In this PR it we increase the working memory to 20MB in an attempt to have all queries run in memory without spilling onto disk, which is
expensive.

The RDS parameter group value for work_mem is specified in KB, so we use a value of 20000.

Our database free memory graph always remains stable at at over 2.3GB and we have fewer than ten simultaneous connections to the database at any given time so we do not expect this increase in working memory to cause a free memory issue.

Connects #1727

Demo

Before

Terraform Plan

I looked up the latest deploy CI job to get the GIT_COMMIT value

From inside the terraform container

bash-5.1#  GIT_COMMIT=ade00e2 ./scripts/infra plan

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ aws_db_parameter_group.default
      parameter.#:                       "8" => "9"
      parameter.1160499149.apply_method: "immediate" => "immediate"
      parameter.1160499149.name:         "log_connections" => "log_connections"
      parameter.1160499149.value:        "0" => "0"
      parameter.1764331501.apply_method: "immediate" => "immediate"
      parameter.1764331501.name:         "log_min_duration_statement" => "log_min_duration_statement"
      parameter.1764331501.value:        "500" => "500"
      parameter.2217426290.apply_method: "immediate" => "immediate"
      parameter.2217426290.name:         "seq_page_cost" => "seq_page_cost"
      parameter.2217426290.value:        "1" => "1"
      parameter.2221178149.apply_method: "immediate" => "immediate"
      parameter.2221178149.name:         "log_disconnections" => "log_disconnections"
      parameter.2221178149.value:        "0" => "0"
      parameter.2311719471.apply_method: "" => "immediate"
      parameter.2311719471.name:         "" => "work_mem"
      parameter.2311719471.value:        "" => "20000"
      parameter.2358470327.apply_method: "immediate" => "immediate"
      parameter.2358470327.name:         "log_autovacuum_min_duration" => "log_autovacuum_min_duration"
      parameter.2358470327.value:        "250" => "250"
      parameter.3022839578.apply_method: "immediate" => "immediate"
      parameter.3022839578.name:         "random_page_cost" => "random_page_cost"
      parameter.3022839578.value:        "1" => "1"
      parameter.3509339723.apply_method: "immediate" => "immediate"
      parameter.3509339723.name:         "log_lock_waits" => "log_lock_waits"
      parameter.3509339723.value:        "1" => "1"
      parameter.3903021451.apply_method: "immediate" => "immediate"
      parameter.3903021451.name:         "log_temp_files" => "log_temp_files"
      parameter.3903021451.value:        "500" => "500"

  ~ aws_lambda_function.alert_batch_failures
      last_modified:                     "2022-03-20T16:10:44.000+0000" => <computed>
      source_code_hash:                  "VgLkfYzd5j8IWSgrCroQbqfce1NO/rTEg4yn1cr4WsI=" => "2DqJ/Rwy6nYEVkLTtEb/rQPlrT6Q6dYrSslegGIp8KY="

  ~ aws_lambda_function.alert_sfn_failures
      last_modified:                     "2022-03-20T16:10:51.000+0000" => <computed>
      source_code_hash:                  "wtFUH3DuaC6xlmaGmD09czVUAiZpys/0Aw/JyUvQBNM=" => "NU40jVAyedVmycqlkpQTYhEawWa+0DHTrnf3hcpXIyU="


Plan: 0 to add, 3 to change, 0 to destroy.

------------------------------------------------------------------------

Note the only valur changes in the aws_db_parameter_group are the version number and work_mem

Terraform apply

From inside the terraform container

bash-5.1#  GIT_COMMIT=ade00e2 ./scripts/infra apply

+ [[ -n ade00e2 ]]
+ GIT_COMMIT=ade00e2
+ '[' ./scripts/infra = ./scripts/infra ']'
+ '[' apply = --help ']'
++ dirname ./scripts/infra
+ TERRAFORM_DIR=./scripts/../deployment/terraform
+ echo

+ echo 'Attempting to deploy application version [ade00e2]...'
Attempting to deploy application version [ade00e2]...
+ echo -----------------------------------------------------
-----------------------------------------------------
+ echo

+ [[ -n openapparelregistry-staging-config-eu-west-1 ]]
+ pushd ./scripts/../deployment/terraform
/usr/local/src/deployment/terraform /usr/local/src
+ aws s3 cp s3://openapparelregistry-staging-config-eu-west-1/terraform/terraform.tfvars openapparelregistry-staging-config-eu-west-1.tfvars
download: s3://openapparelregistry-staging-config-eu-west-1/terraform/terraform.tfvars to ./openapparelregistry-staging-config-eu-west-1.tfvars
+ case "${1}" in
+ terraform apply openapparelregistry-staging-config-eu-west-1.tfplan
aws_db_parameter_group.default: Modifying... (ID: openapparelregistry-stg20201008160659946100000001)
  parameter.#:                       "8" => "9"
  parameter.1160499149.apply_method: "immediate" => "immediate"
  parameter.1160499149.name:         "log_connections" => "log_connections"
  parameter.1160499149.value:        "0" => "0"
  parameter.1764331501.apply_method: "immediate" => "immediate"
  parameter.1764331501.name:         "log_min_duration_statement" => "log_min_duration_statement"
  parameter.1764331501.value:        "500" => "500"
  parameter.2217426290.apply_method: "immediate" => "immediate"
  parameter.2217426290.name:         "seq_page_cost" => "seq_page_cost"
  parameter.2217426290.value:        "1" => "1"
  parameter.2221178149.apply_method: "immediate" => "immediate"
  parameter.2221178149.name:         "log_disconnections" => "log_disconnections"
  parameter.2221178149.value:        "0" => "0"
  parameter.2311719471.apply_method: "" => "immediate"
  parameter.2311719471.name:         "" => "work_mem"
  parameter.2311719471.value:        "" => "20000"
  parameter.2358470327.apply_method: "immediate" => "immediate"
  parameter.2358470327.name:         "log_autovacuum_min_duration" => "log_autovacuum_min_duration"
  parameter.2358470327.value:        "250" => "250"
  parameter.3022839578.apply_method: "immediate" => "immediate"
  parameter.3022839578.name:         "random_page_cost" => "random_page_cost"
  parameter.3022839578.value:        "1" => "1"
  parameter.3509339723.apply_method: "immediate" => "immediate"
  parameter.3509339723.name:         "log_lock_waits" => "log_lock_waits"
  parameter.3509339723.value:        "1" => "1"
  parameter.3903021451.apply_method: "immediate" => "immediate"
  parameter.3903021451.name:         "log_temp_files" => "log_temp_files"
  parameter.3903021451.value:        "500" => "500"
aws_lambda_function.alert_batch_failures: Modifying... (ID: funcStagingAlertBatchFailures)
  last_modified:    "2022-03-20T16:10:44.000+0000" => "<computed>"
  source_code_hash: "VgLkfYzd5j8IWSgrCroQbqfce1NO/rTEg4yn1cr4WsI=" => "2DqJ/Rwy6nYEVkLTtEb/rQPlrT6Q6dYrSslegGIp8KY="
aws_lambda_function.alert_sfn_failures: Modifying... (ID: funcStagingAlertStepFunctionsFailures)
  last_modified:    "2022-03-20T16:10:51.000+0000" => "<computed>"
  source_code_hash: "wtFUH3DuaC6xlmaGmD09czVUAiZpys/0Aw/JyUvQBNM=" => "NU40jVAyedVmycqlkpQTYhEawWa+0DHTrnf3hcpXIyU="
aws_db_parameter_group.default: Modifications complete after 7s (ID: openapparelregistry-stg20201008160659946100000001)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 10s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 10s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 20s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 20s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 30s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 30s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 40s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 40s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 50s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 50s elapsed)
aws_lambda_function.alert_batch_failures: Modifications complete after 57s (ID: funcStagingAlertBatchFailures)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m0s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m10s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m20s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m30s elapsed)
aws_lambda_function.alert_sfn_failures: Modifications complete after 1m38s (ID: funcStagingAlertStepFunctionsFailures)

Apply complete! Resources: 0 added, 3 changed, 0 destroyed.
+ [[ -n '' ]]
+ popd
/usr/local/src

After

Testing Instructions

Review the Demo above

Checklist

fixup! commits have been squashed
CI passes after rebase
CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

TaiWilkin

I've read through the notes and everything appears correct to me. We have already been running this in staging and it has been working as expected.

When troubleshooting production outages due to 100% CPU usage we use pgbadger to analyze and inspect the database logs. In the report of temp file usage there were some queries that produced temp files over 4MB in size, the default working memory setting for Postgres. The largest temp file we saw in our logs was just over 16MB so in this commit we increase the working memory to 20MB in an attempt to have all queries run in memory without spilling onto disk, which is expensive. The RDS parameter group value for `work_mem` is specified in KB, so we use a value of 20000. Our database free memory graph always remains stable at at over 2.3GB and we have fewer than ten simultaneous connections to the database at any given time so we do not expect this increase in working memory to cause a free memory issue.

jwalgran requested a review from TaiWilkin March 20, 2022 19:34

jwalgran assigned TaiWilkin Mar 20, 2022

TaiWilkin approved these changes Mar 20, 2022

View reviewed changes

TaiWilkin assigned jwalgran and unassigned TaiWilkin Mar 20, 2022

jwalgran force-pushed the feature/jcw/increase-db-work-mem branch from 97c1efe to 8fad3dc Compare March 20, 2022 21:08

jwalgran merged commit fcd51ae into develop Mar 20, 2022

jwalgran deleted the feature/jcw/increase-db-work-mem branch March 20, 2022 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase database work_mem to 20MB #1744

Increase database work_mem to 20MB #1744

jwalgran commented Mar 20, 2022

TaiWilkin left a comment

Increase database work_mem to 20MB #1744

Increase database work_mem to 20MB #1744

Conversation

jwalgran commented Mar 20, 2022

Overview

Demo

Before

Terraform Plan

Terraform apply

After

Testing Instructions

Checklist

TaiWilkin left a comment

Choose a reason for hiding this comment