Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Increase database work_mem to 20MB #1744

Merged
merged 1 commit into from Mar 20, 2022

Conversation

jwalgran
Copy link
Contributor

Overview

When troubleshooting production outages due to 100% CPU usage we use pgbadger to analyze and inspect the database logs. In the report of temp file usage there were some queries that produced temp files over 4MB in size, the default working memory setting for Postgres. The largest temp file we saw in our logs was just over 16MB.

Screen Shot 2022-03-20 at 11 59 34 AM

Screen Shot 2022-03-20 at 11 59 53 AM

In this PR it we increase the working memory to 20MB in an attempt to have all queries run in memory without spilling onto disk, which is
expensive.

The RDS parameter group value for work_mem is specified in KB, so we use a value of 20000.

Our database free memory graph always remains stable at at over 2.3GB and we have fewer than ten simultaneous connections to the database at any given time so we do not expect this increase in working memory to cause a free memory issue.

Screen Shot 2022-03-20 at 12 08 49 PM

Connects #1727

Demo

Before

Screen Shot 2022-03-20 at 11 54 57 AM

Terraform Plan

I looked up the latest deploy CI job to get the GIT_COMMIT value

From inside the terraform container

bash-5.1#  GIT_COMMIT=ade00e2 ./scripts/infra plan
------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ aws_db_parameter_group.default
      parameter.#:                       "8" => "9"
      parameter.1160499149.apply_method: "immediate" => "immediate"
      parameter.1160499149.name:         "log_connections" => "log_connections"
      parameter.1160499149.value:        "0" => "0"
      parameter.1764331501.apply_method: "immediate" => "immediate"
      parameter.1764331501.name:         "log_min_duration_statement" => "log_min_duration_statement"
      parameter.1764331501.value:        "500" => "500"
      parameter.2217426290.apply_method: "immediate" => "immediate"
      parameter.2217426290.name:         "seq_page_cost" => "seq_page_cost"
      parameter.2217426290.value:        "1" => "1"
      parameter.2221178149.apply_method: "immediate" => "immediate"
      parameter.2221178149.name:         "log_disconnections" => "log_disconnections"
      parameter.2221178149.value:        "0" => "0"
      parameter.2311719471.apply_method: "" => "immediate"
      parameter.2311719471.name:         "" => "work_mem"
      parameter.2311719471.value:        "" => "20000"
      parameter.2358470327.apply_method: "immediate" => "immediate"
      parameter.2358470327.name:         "log_autovacuum_min_duration" => "log_autovacuum_min_duration"
      parameter.2358470327.value:        "250" => "250"
      parameter.3022839578.apply_method: "immediate" => "immediate"
      parameter.3022839578.name:         "random_page_cost" => "random_page_cost"
      parameter.3022839578.value:        "1" => "1"
      parameter.3509339723.apply_method: "immediate" => "immediate"
      parameter.3509339723.name:         "log_lock_waits" => "log_lock_waits"
      parameter.3509339723.value:        "1" => "1"
      parameter.3903021451.apply_method: "immediate" => "immediate"
      parameter.3903021451.name:         "log_temp_files" => "log_temp_files"
      parameter.3903021451.value:        "500" => "500"

  ~ aws_lambda_function.alert_batch_failures
      last_modified:                     "2022-03-20T16:10:44.000+0000" => <computed>
      source_code_hash:                  "VgLkfYzd5j8IWSgrCroQbqfce1NO/rTEg4yn1cr4WsI=" => "2DqJ/Rwy6nYEVkLTtEb/rQPlrT6Q6dYrSslegGIp8KY="

  ~ aws_lambda_function.alert_sfn_failures
      last_modified:                     "2022-03-20T16:10:51.000+0000" => <computed>
      source_code_hash:                  "wtFUH3DuaC6xlmaGmD09czVUAiZpys/0Aw/JyUvQBNM=" => "NU40jVAyedVmycqlkpQTYhEawWa+0DHTrnf3hcpXIyU="


Plan: 0 to add, 3 to change, 0 to destroy.

------------------------------------------------------------------------

Note the only valur changes in the aws_db_parameter_group are the version number and work_mem

Terraform apply

From inside the terraform container

bash-5.1#  GIT_COMMIT=ade00e2 ./scripts/infra apply
+ [[ -n ade00e2 ]]
+ GIT_COMMIT=ade00e2
+ '[' ./scripts/infra = ./scripts/infra ']'
+ '[' apply = --help ']'
++ dirname ./scripts/infra
+ TERRAFORM_DIR=./scripts/../deployment/terraform
+ echo

+ echo 'Attempting to deploy application version [ade00e2]...'
Attempting to deploy application version [ade00e2]...
+ echo -----------------------------------------------------
-----------------------------------------------------
+ echo

+ [[ -n openapparelregistry-staging-config-eu-west-1 ]]
+ pushd ./scripts/../deployment/terraform
/usr/local/src/deployment/terraform /usr/local/src
+ aws s3 cp s3://openapparelregistry-staging-config-eu-west-1/terraform/terraform.tfvars openapparelregistry-staging-config-eu-west-1.tfvars
download: s3://openapparelregistry-staging-config-eu-west-1/terraform/terraform.tfvars to ./openapparelregistry-staging-config-eu-west-1.tfvars
+ case "${1}" in
+ terraform apply openapparelregistry-staging-config-eu-west-1.tfplan
aws_db_parameter_group.default: Modifying... (ID: openapparelregistry-stg20201008160659946100000001)
  parameter.#:                       "8" => "9"
  parameter.1160499149.apply_method: "immediate" => "immediate"
  parameter.1160499149.name:         "log_connections" => "log_connections"
  parameter.1160499149.value:        "0" => "0"
  parameter.1764331501.apply_method: "immediate" => "immediate"
  parameter.1764331501.name:         "log_min_duration_statement" => "log_min_duration_statement"
  parameter.1764331501.value:        "500" => "500"
  parameter.2217426290.apply_method: "immediate" => "immediate"
  parameter.2217426290.name:         "seq_page_cost" => "seq_page_cost"
  parameter.2217426290.value:        "1" => "1"
  parameter.2221178149.apply_method: "immediate" => "immediate"
  parameter.2221178149.name:         "log_disconnections" => "log_disconnections"
  parameter.2221178149.value:        "0" => "0"
  parameter.2311719471.apply_method: "" => "immediate"
  parameter.2311719471.name:         "" => "work_mem"
  parameter.2311719471.value:        "" => "20000"
  parameter.2358470327.apply_method: "immediate" => "immediate"
  parameter.2358470327.name:         "log_autovacuum_min_duration" => "log_autovacuum_min_duration"
  parameter.2358470327.value:        "250" => "250"
  parameter.3022839578.apply_method: "immediate" => "immediate"
  parameter.3022839578.name:         "random_page_cost" => "random_page_cost"
  parameter.3022839578.value:        "1" => "1"
  parameter.3509339723.apply_method: "immediate" => "immediate"
  parameter.3509339723.name:         "log_lock_waits" => "log_lock_waits"
  parameter.3509339723.value:        "1" => "1"
  parameter.3903021451.apply_method: "immediate" => "immediate"
  parameter.3903021451.name:         "log_temp_files" => "log_temp_files"
  parameter.3903021451.value:        "500" => "500"
aws_lambda_function.alert_batch_failures: Modifying... (ID: funcStagingAlertBatchFailures)
  last_modified:    "2022-03-20T16:10:44.000+0000" => "<computed>"
  source_code_hash: "VgLkfYzd5j8IWSgrCroQbqfce1NO/rTEg4yn1cr4WsI=" => "2DqJ/Rwy6nYEVkLTtEb/rQPlrT6Q6dYrSslegGIp8KY="
aws_lambda_function.alert_sfn_failures: Modifying... (ID: funcStagingAlertStepFunctionsFailures)
  last_modified:    "2022-03-20T16:10:51.000+0000" => "<computed>"
  source_code_hash: "wtFUH3DuaC6xlmaGmD09czVUAiZpys/0Aw/JyUvQBNM=" => "NU40jVAyedVmycqlkpQTYhEawWa+0DHTrnf3hcpXIyU="
aws_db_parameter_group.default: Modifications complete after 7s (ID: openapparelregistry-stg20201008160659946100000001)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 10s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 10s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 20s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 20s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 30s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 30s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 40s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 40s elapsed)
aws_lambda_function.alert_batch_failures: Still modifying... (ID: funcStagingAlertBatchFailures, 50s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 50s elapsed)
aws_lambda_function.alert_batch_failures: Modifications complete after 57s (ID: funcStagingAlertBatchFailures)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m0s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m10s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m20s elapsed)
aws_lambda_function.alert_sfn_failures: Still modifying... (ID: funcStagingAlertStepFunctionsFailures, 1m30s elapsed)
aws_lambda_function.alert_sfn_failures: Modifications complete after 1m38s (ID: funcStagingAlertStepFunctionsFailures)

Apply complete! Resources: 0 added, 3 changed, 0 destroyed.
+ [[ -n '' ]]
+ popd
/usr/local/src

After

Screen Shot 2022-03-20 at 12 25 26 PM

Testing Instructions

  • Review the Demo above

Checklist

  • fixup! commits have been squashed
  • CI passes after rebase
  • CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

Copy link
Contributor

@TaiWilkin TaiWilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the notes and everything appears correct to me. We have already been running this in staging and it has been working as expected.

@TaiWilkin TaiWilkin assigned jwalgran and unassigned TaiWilkin Mar 20, 2022
When troubleshooting production outages due to 100% CPU usage we use pgbadger to
analyze and inspect the database logs. In the report of temp file usage there
were some queries that produced temp files over 4MB in size, the default working
memory setting for Postgres. The largest temp file we saw in our logs was just
over 16MB so in this commit we increase the working memory to 20MB in an attempt
to have all queries run in memory without spilling onto disk, which is
expensive.

The RDS parameter group value for `work_mem` is specified in KB, so we use a
value of 20000.

Our database free memory graph always remains stable at at over 2.3GB and we
have fewer than ten simultaneous connections to the database at any given time
so we do not expect this increase in working memory to cause a free memory
issue.
@jwalgran jwalgran force-pushed the feature/jcw/increase-db-work-mem branch from 97c1efe to 8fad3dc Compare March 20, 2022 21:08
@jwalgran jwalgran merged commit fcd51ae into develop Mar 20, 2022
@jwalgran jwalgran deleted the feature/jcw/increase-db-work-mem branch March 20, 2022 21:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants