[BUG] High load when updating internal users using PATCH #4008

Jakob3xD · 2024-02-01T12:39:42Z

What is the bug?
We are currently experiencing following behavior, one of our clusters goes to 100% CPU usage on all nodes when we send a PATCH request changing internal_users. It does not matter if we add new users, delete them or delete non existing users. After the PATCH request is answered/returned the load goes up for a few minutes causing node failures from time to time or rejecting requests.

In other clusters we can also see an load increase after the PATCH requests returned but for a much shorter time frame.
It seems like this is caused by BCrypt see jstack below.

How can one reproduce the bug?
I could not reproduce it in with this the large time frame but the shorter load spike (~20s).
Steps to reproduce the behavior:

Setup a 3 Node Opensearch Cluster - 3x 8 vCPU 30 GB Ram (Opensearch has 16GB Ram)
Create some Users that are allowed to index and request info (~300 Users)
Create load on the cluster with bulk and info request by doing a lot of requests - see golang file in below.
Send a PATCH requests that for example deletes 40 users that does not exists
See the Load going up on all nodes.

Golang create base load file:

package main

import (
	"context"
	"fmt"
	"os"
	"strings"
	"math/rand"
	"time"

	"github.com/opensearch-project/opensearch-go/v3"
	"github.com/opensearch-project/opensearch-go/v3/opensearchapi"
	"github.com/opensearch-project/opensearch-go/v3/opensearchutil"
)

func main() {
	if err := example(); err != nil {
		fmt.Println(fmt.Sprintf("Error: %s", err))
		os.Exit(1)
	}
}

func example() error {
	index := "some-index"
	c := make(chan int)

	for i := 1; i <= 300; i++ {
		go indexer(initClient(i), index, c)
		go info(initClient(i), c)
	}
	<-c
	return nil
}

func initClient(loop int) (*opensearchapi.Client) {
	client, err := opensearchapi.NewClient(
		opensearchapi.Config{
			Client: opensearch.Config{
				Addresses: []string{
					"https://loc1-test-opensearch1-n1.example.com",
					"https://loc1-test-opensearch1-n2.example.com",
					"https://loc1-test-opensearch1-n3.example.com",
				},
				Username: fmt.Sprintf("node-index-%d", loop),
                                Password: <password>
			},
		},
	)
	if err != nil {
		fmt.Printf("Unexpected error: %s\n", err)
	}
	return client
}

func info(client *opensearchapi.Client, c chan int) {
	ticker := time.NewTicker(time.Duration(500 + rand.Intn(500)) * time.Millisecond)
	defer ticker.Stop()
	for range ticker.C {
		client.Info(nil, nil)
	}
	c <- 0
}

func indexer(client *opensearchapi.Client, index string, c chan int) {

	bi, err := opensearchutil.NewBulkIndexer(opensearchutil.BulkIndexerConfig{
		Index:   index,
		Client:  client,
		FlushInterval: time.Duration(rand.Intn(500) + 500) * time.Millisecond,
	})
	if err != nil {
		fmt.Printf("Unexpected error: %s\n", err)
	}
	ticker := time.NewTicker(time.Duration(rand.Intn(100) + 100) * time.Millisecond)
	defer ticker.Stop()
	for range ticker.C {
		err := bi.Add(context.Background(), opensearchutil.BulkIndexerItem{
			Action: "index",
			Body:   strings.NewReader(`{"title":"test", "doc": "90"}`),
		})
		if err != nil {
			fmt.Printf("Unexpected error: %s\n", err)
		}
	}
	if err := bi.Close(context.Background()); err != nil {
		fmt.Printf("Unexpected error: %s\n", err)
	}
	c <- 0
}

What is the expected behavior?
I would expect a lower load increase on all nodes when users change.

What is your host/environment?

OS: Ubuntu 22.04
Opensearch Docker - 2.11.1
Plugins - all default plugins and aiven/prometheus-exporter-plugin-for-opensearch except opensearch-performance-analyzer.

Do you have any screenshots?

Do you have any additional context?
Jstack during the high load time frame:
opensearch-high-load-jstack.txt

peternied · 2024-02-01T16:45:13Z

@Jakob3xD Thanks for filing this issue - really appreciate the test configuration and the graphs on the issue.

peternied · 2024-02-01T16:49:54Z

Maybe this is related https://stackoverflow.com/questions/36471723/bcrypt-performance-deterioration

Jakob3xD · 2024-02-02T07:55:46Z

Maybe this is related https://stackoverflow.com/questions/36471723/bcrypt-performance-deterioration

I have looked into it but the entropy_avail of /dev/random/ does not decrease. Also the short bash script for cat /dev/random does not have any affect on the systems load. Running pv /dev/random > /dev/null show a constant output of ~180MiB/s unaffected by the bash script from the answer. Therefore I assume this is not related to my issue?

scrawfor99 · 2024-02-05T16:30:30Z

[Triage] Hi @Jakob3xD thanks for filing this issue. Thank you for filing this issue. We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.

Jakob3xD · 2024-02-06T08:48:23Z

We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.

I am not sure I understand your sentence correctly. Can you please rephrase it for me.

I opened the issue as I am not familiar with java and have now further ideas on how to debug this or add more useful information.

scrawfor99 · 2024-02-06T14:28:35Z

Hi @Jakob3xD, the comment I left was part of the weekly triaging process the contributors go through. You can read more about it on the TRIAGING markdown file.

It is not clear whether OpenSearch or something we are dependent on is causing the CPU use. We may not be able to correct the issue if it is from a dependency, but we can at least point to the root issue if that is the case.

willyborankin · 2024-02-19T17:34:44Z

I reproduced it on 1 node cluster and the issue as @peternied mentioned related to BCrypt.
Here is result without password validation:

Here is result with password validation:

Jakob3xD added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 1, 2024

peternied added performance Make it fast! and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 1, 2024

scrawfor99 added the triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. label Feb 5, 2024

peternied mentioned this issue Feb 26, 2024

[FEATURE] Add support for PBKDF2 and Scrypt/yescrypt for passwords hashing #4079

Open

cwperks mentioned this issue May 13, 2024

[RFC] Security Performance Test Suite #3903

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] High load when updating internal users using PATCH #4008

[BUG] High load when updating internal users using PATCH #4008

Jakob3xD commented Feb 1, 2024

peternied commented Feb 1, 2024

peternied commented Feb 1, 2024

Jakob3xD commented Feb 2, 2024

scrawfor99 commented Feb 5, 2024

Jakob3xD commented Feb 6, 2024

scrawfor99 commented Feb 6, 2024

willyborankin commented Feb 19, 2024 •

edited

[BUG] High load when updating internal users using PATCH #4008

[BUG] High load when updating internal users using PATCH #4008

Comments

Jakob3xD commented Feb 1, 2024

peternied commented Feb 1, 2024

peternied commented Feb 1, 2024

Jakob3xD commented Feb 2, 2024

scrawfor99 commented Feb 5, 2024

Jakob3xD commented Feb 6, 2024

scrawfor99 commented Feb 6, 2024

willyborankin commented Feb 19, 2024 • edited

willyborankin commented Feb 19, 2024 •

edited