Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] High load when updating internal users using PATCH #4008

Open
Jakob3xD opened this issue Feb 1, 2024 · 7 comments
Open

[BUG] High load when updating internal users using PATCH #4008

Jakob3xD opened this issue Feb 1, 2024 · 7 comments
Labels
bug Something isn't working performance Make it fast! triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@Jakob3xD
Copy link

Jakob3xD commented Feb 1, 2024

What is the bug?
We are currently experiencing following behavior, one of our clusters goes to 100% CPU usage on all nodes when we send a PATCH request changing internal_users. It does not matter if we add new users, delete them or delete non existing users. After the PATCH request is answered/returned the load goes up for a few minutes causing node failures from time to time or rejecting requests.

In other clusters we can also see an load increase after the PATCH requests returned but for a much shorter time frame.
It seems like this is caused by BCrypt see jstack below.

How can one reproduce the bug?
I could not reproduce it in with this the large time frame but the shorter load spike (~20s).
Steps to reproduce the behavior:

  1. Setup a 3 Node Opensearch Cluster - 3x 8 vCPU 30 GB Ram (Opensearch has 16GB Ram)
  2. Create some Users that are allowed to index and request info (~300 Users)
  3. Create load on the cluster with bulk and info request by doing a lot of requests - see golang file in below.
  4. Send a PATCH requests that for example deletes 40 users that does not exists
  5. See the Load going up on all nodes.

Golang create base load file:

package main

import (
	"context"
	"fmt"
	"os"
	"strings"
	"math/rand"
	"time"

	"github.com/opensearch-project/opensearch-go/v3"
	"github.com/opensearch-project/opensearch-go/v3/opensearchapi"
	"github.com/opensearch-project/opensearch-go/v3/opensearchutil"
)

func main() {
	if err := example(); err != nil {
		fmt.Println(fmt.Sprintf("Error: %s", err))
		os.Exit(1)
	}
}

func example() error {
	index := "some-index"
	c := make(chan int)

	for i := 1; i <= 300; i++ {
		go indexer(initClient(i), index, c)
		go info(initClient(i), c)
	}
	<-c
	return nil
}

func initClient(loop int) (*opensearchapi.Client) {
	client, err := opensearchapi.NewClient(
		opensearchapi.Config{
			Client: opensearch.Config{
				Addresses: []string{
					"https://loc1-test-opensearch1-n1.example.com",
					"https://loc1-test-opensearch1-n2.example.com",
					"https://loc1-test-opensearch1-n3.example.com",
				},
				Username: fmt.Sprintf("node-index-%d", loop),
                                Password: <password>
			},
		},
	)
	if err != nil {
		fmt.Printf("Unexpected error: %s\n", err)
	}
	return client
}

func info(client *opensearchapi.Client, c chan int) {
	ticker := time.NewTicker(time.Duration(500 + rand.Intn(500)) * time.Millisecond)
	defer ticker.Stop()
	for range ticker.C {
		client.Info(nil, nil)
	}
	c <- 0
}

func indexer(client *opensearchapi.Client, index string, c chan int) {

	bi, err := opensearchutil.NewBulkIndexer(opensearchutil.BulkIndexerConfig{
		Index:   index,
		Client:  client,
		FlushInterval: time.Duration(rand.Intn(500) + 500) * time.Millisecond,
	})
	if err != nil {
		fmt.Printf("Unexpected error: %s\n", err)
	}
	ticker := time.NewTicker(time.Duration(rand.Intn(100) + 100) * time.Millisecond)
	defer ticker.Stop()
	for range ticker.C {
		err := bi.Add(context.Background(), opensearchutil.BulkIndexerItem{
			Action: "index",
			Body:   strings.NewReader(`{"title":"test", "doc": "90"}`),
		})
		if err != nil {
			fmt.Printf("Unexpected error: %s\n", err)
		}
	}
	if err := bi.Close(context.Background()); err != nil {
		fmt.Printf("Unexpected error: %s\n", err)
	}
	c <- 0
}

What is the expected behavior?
I would expect a lower load increase on all nodes when users change.

What is your host/environment?

  • OS: Ubuntu 22.04
  • Opensearch Docker - 2.11.1
  • Plugins - all default plugins and aiven/prometheus-exporter-plugin-for-opensearch except opensearch-performance-analyzer.

Do you have any screenshots?
image

Do you have any additional context?
Jstack during the high load time frame:
opensearch-high-load-jstack.txt

@Jakob3xD Jakob3xD added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 1, 2024
@peternied peternied added performance Make it fast! and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 1, 2024
@peternied
Copy link
Member

@Jakob3xD Thanks for filing this issue - really appreciate the test configuration and the graphs on the issue.

@peternied
Copy link
Member

@Jakob3xD
Copy link
Author

Jakob3xD commented Feb 2, 2024

Maybe this is related https://stackoverflow.com/questions/36471723/bcrypt-performance-deterioration

I have looked into it but the entropy_avail of /dev/random/ does not decrease. Also the short bash script for cat /dev/random does not have any affect on the systems load. Running pv /dev/random > /dev/null show a constant output of ~180MiB/s unaffected by the bash script from the answer. Therefore I assume this is not related to my issue?

@scrawfor99
Copy link
Collaborator

[Triage] Hi @Jakob3xD thanks for filing this issue. Thank you for filing this issue. We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.

@scrawfor99 scrawfor99 added the triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. label Feb 5, 2024
@Jakob3xD
Copy link
Author

Jakob3xD commented Feb 6, 2024

We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.

I am not sure I understand your sentence correctly. Can you please rephrase it for me.

I opened the issue as I am not familiar with java and have now further ideas on how to debug this or add more useful information.

@scrawfor99
Copy link
Collaborator

Hi @Jakob3xD, the comment I left was part of the weekly triaging process the contributors go through. You can read more about it on the TRIAGING markdown file.

It is not clear whether OpenSearch or something we are dependent on is causing the CPU use. We may not be able to correct the issue if it is from a dependency, but we can at least point to the root issue if that is the case.

@willyborankin
Copy link
Collaborator

willyborankin commented Feb 19, 2024

I reproduced it on 1 node cluster and the issue as @peternied mentioned related to BCrypt.
Here is result without password validation:
image

Here is result with password validation:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Make it fast! triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

No branches or pull requests

4 participants