When the context is cancelled the node is marked dead #484

AndreKR · 2017-03-17T07:09:24Z

Version

elastic.v5 (for Elasticsearch 5.x)

How to reproduce:

package main

import (
	"context"
	"gopkg.in/olivere/elastic.v5"
	"log"
	"os"
	"time"
)

func main() {

	var err error

	client, err := elastic.NewClient(
		elastic.SetURL("https://httpbin.org/delay/3?"), // every request will take about 3 seconds
		elastic.SetHealthcheck(false),
		elastic.SetSniff(false),
		elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
		elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
	)
	if err != nil {
		log.Fatal(err)
	}

	ctx, _ := context.WithTimeout(context.Background(), 1*time.Second) // requests will time out after 1 second

	log.Println("Running request")

	_, err = client.Get().Index("whatever").Id("1").Do(ctx)

	if err != nil {
		log.Println("Error: " + err.Error())
	}

	log.Println("Running second request")

	_, err = client.Get().Index("whatever").Id("1").Do(ctx)

	if err != nil {
		log.Println("Error: " + err.Error())
	}

}

Actual

2017/03/17 08:02:33 Running request
2017/03/17 08:02:34 elastic: https://httpbin.org/delay/3? is dead
2017/03/17 08:02:34 Error: context deadline exceeded
2017/03/17 08:02:34 Running second request
2017/03/17 08:02:34 elastic: all 1 nodes marked as dead; resurrecting them to prevent deadlock
2017/03/17 08:02:34 Error: no Elasticsearch node available

Expected

Something like (I edited that "log" myself):

2017/03/17 08:02:33 Running request
2017/03/17 08:02:34 Error: context deadline exceeded
2017/03/17 08:02:34 Running second request
2017/03/17 08:02:37 GET https://httpbin.org/delay/3?/whatever/_all/1 [status:200, request:3.500s]

The text was updated successfully, but these errors were encountered:

olivere · 2017-03-17T07:59:23Z

Thanks for the helpful issue report. I will look into it asap.

See #484

olivere · 2017-03-17T08:14:51Z

@AndreKR Can you review the above code, please? I think it's the correct fix, but maybe I've been missing something.

$ ./issue-484
2017/03/17 09:12:27 Running request
2017/03/17 09:12:28 Error: context deadline exceeded
2017/03/17 09:12:28 Running second request
2017/03/17 09:12:28 Error: context deadline exceeded

olivere · 2017-03-17T08:17:11Z

The second result in your example should also fail because you reused the canceled/deadlined context from request 1. If you specify a new context with a timeout of 4s, the output would be:

$ ./issue-484
2017/03/17 09:16:04 Running request
2017/03/17 09:16:05 Error: context deadline exceeded
2017/03/17 09:16:05 Running second request
2017/03/17 09:16:09 client.go:727: GET https://httpbin.org/delay/3?/whatever/_all/1 [status:200, request:3.498s]

AndreKR · 2017-03-22T06:21:09Z

Yep, works like a charm. (With the fixed test code and in my real application.)

From looking at the code it seems that if an error were to happen during a retry it would still mark the node dead, but that might actually be a sensible thing to do.

See #484

olivere · 2017-03-26T09:27:59Z

Will be fixed in 3.0.67 and 5.0.31.

AndreKR · 2017-08-01T05:41:46Z

This issue has returned. I'm not sure why it worked back then but maybe we both were using Go 1.7 at that point?

Anyway, the problem is that http.Client no longer returns context.Canceled when its context was canceled. It now returns an *url.Error, so that this doesn't work anymore.

Fortunately the *url.Error contains the original error in Err, so the problem can be solved again like this:

res, err := c.c.Do((*http.Request)(req).WithContext(ctx))
if uerr, ok := err.(*url.Error); ok {
	if uerr.Err == context.Canceled || uerr.Err == context.DeadlineExceeded {
		// Proceed, but don't mark the node as dead
		return nil, err
	}
}

olivere · 2017-08-01T06:31:23Z

The problem seems to occur on e.g. when redirect fails. I will add another check for that case.

I've already used 1.8 back then.

In certain cases, the returned error on a canceled or deadlined request is not `context.Canceled` or `context.DeadlineExceeded`, but a `*url.Error` whose `Error` field carries the above context errors. In the standard library, there is a specific test for redirects that checks this case (https://golang.org/src/net/http/client_test.go#L329), so we fix this in the same way in `PerformRequest`. See #484

olivere · 2017-08-01T06:44:54Z

The origin of this seems to be 5 years ago :-)

olivere · 2017-08-01T06:49:14Z

Just pushed 5.0.44.

AndreKR · 2017-08-01T22:32:34Z

Fantastic. :)

For the record, this is not related to redirects, try this:

client := http.Client{}
ctx, _ := context.WithTimeout(context.Background(), 1*time.Second)   // requests will time out after 1 second
req, _ := http.NewRequest("GET", "https://httpbin.org/delay/3", nil) // every request will take about 3 seconds
_, err := client.Do(req.WithContext(ctx))
fmt.Println(err)
fmt.Println(reflect.TypeOf(err).String())
fmt.Println(err == context.Canceled)
fmt.Println(err == context.DeadlineExceeded)

Output:

Get https://httpbin.org/delay/3: context deadline exceeded
*url.Error
false
false

See olivere#484

wedneyyuri · 2018-06-19T20:14:33Z

@olivere it seems that we are facing the same problem on elastic.v6.

Version
elastic.v6 (for Elasticsearch 6.x)

How to reproduce:

package main

import (
	"context"
	"log"
	"os"
	"time"

	"gopkg.in/olivere/elastic.v6"
)

func main() {
	var err error
	client, err := elastic.NewClient(
		elastic.SetURL(
			"http://localhost:9200",
		),
		elastic.SetSniff(false),
		elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
		elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
	)
	if err != nil {
		log.Fatal(err)
	}

	for i := 0; i < 50; i++ {
		func(i int) {
			log.Println("Running request ", i)

			ctx, cancelFunc := context.WithTimeout(context.Background(), 1*time.Millisecond)
			defer cancelFunc()

			_, err := client.Get().Index("index-name").Type("_doc").Id("35642796").Do(ctx)
			if err != nil {
				log.Println("Err: " + err.Error())
			}

			log.Println("Finished request ", i)
			log.Print("\n\n\n")
		}(i)
	}
}

Output log

**Log:**

018/06/19 12:01:25 Running request  0
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: context deadline exceeded
2018/06/19 12:01:25 Finished request  0
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  1
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: context deadline exceeded
2018/06/19 12:01:25 Finished request  1
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  2
2018/06/19 12:01:25 elastic: http://localhost:9200 is dead
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: dial tcp: lookup internal-shop-alpha-elastic-blue-node-1202787545.sa-east-1.elb.amazonaws.com on 10.13.31.180:53: dial udp 10.13.31.180:53: i/o timeout
2018/06/19 12:01:25 Finished request  2
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  3
2018/06/19 12:01:25 elastic: all 1 nodes marked as dead; resurrecting them to prevent deadlock
2018/06/19 12:01:25 Err: no available connection: no Elasticsearch node available
2018/06/19 12:01:25 Finished request  3
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  4
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: context deadline exceeded
2018/06/19 12:01:25 Finished request  4
2018/06/19 12:01:25

olivere · 2018-06-19T20:35:49Z

I'm on vacation now, but re-opening to look into it when I'm back.

wedneyyuri · 2018-06-22T20:07:37Z

Thank you, disabling HealthCheck seems like the better solution for now.

The code below will not produce this error:

client, err := elastic.NewClient(
	elastic.SetURL(
		"http://localhost:9200",
	),
	elastic.SetSniff(false),
	elastic.SetHealthcheck(false),
	elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
	elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
)

olivere · 2018-07-01T09:41:54Z

@wedneyyuri So the problem seems to be that the healthcheck context runs into a timeout, hence the context is canceled. If the healthcheck runs into a timeout, what is it going to do? I think it's correct to mark the connection as dead if the healthcheck doesn't return in time. Am I missing something?

Notice that even if all nodes are dead, PerformRequest will resurrect a connection to complete the request, even if marked as dead.

wedneyyuri · 2018-07-01T21:35:39Z

@olivere Seems like PerformRequest is resurrecting the connection after returning the error Err: no available connection: no Elasticsearch node available.

I guess it's causing safe requests to be cancelled before they hit the server.

When the retrier fails due to a context timeout, don't mark the node as dead. This is a possible fix to #484.

olivere · 2018-07-02T09:55:30Z

@wedneyyuri Can you try the context-canceled.issue-484 branch to see if that fixes the problem?

wedneyyuri · 2018-07-10T14:24:29Z

Thanks @leezhm and @olivere its working fine on this branch.

olivere · 2018-07-11T11:13:52Z

Will merge this in the next release then. Thanks for your support.

This commit ignores *url.Error errors that are marked as temporary. It also removes the context timeout checks for retrier added earlier because that cannot happen in that code path. Fix #484

arunplayground · 2018-07-16T06:53:42Z

I still get the following error no available connection: no Elasticsearch node available

I am running my code in a multi threaded env (10 goroutines). Each go routine has an instance of a client. In each goroutine, I do get documents and insert documents using context.Background().

If the code is fixed, i am not sure where i am at fault. I have disabled sniffing and healthcheck (elastic.SetSniff(false), elastic.SetHealthcheck(false))

Any insights would help

AndreKR · 2018-07-16T07:05:43Z

If you get no available connection: no Elasticsearch node available you should have gotten another error before this one which should have told you why all nodes have been marked dead.

arunplayground · 2018-07-16T08:02:58Z

Get http://127.0.0.1:9200/instruments/doc/IDBBGLOBAL138357_IDBBUNIQUE138357: dial tcp 127.0.0.1:9200:connect: can't assign requested address.

This is the message i get before the "no available connection" error.

Also i changed my code, so that i use a single client for all by goroutines, instead of individual clients per goroutine.

I dont get the error message all the time, but i do get it off and on.
I have disabled sniffing and healthcheck (elastic.SetSniff(false), elastic.SetHealthcheck(false))

olivere · 2018-07-16T12:46:22Z

@arunplayground When seeing connect: can't assign requested address. you are probably exceeding the number of local socket connections permitted by your OS. See e.g. golang/go#16012.

arunplayground · 2018-07-16T14:15:30Z

You are absolutely right. There were rouge connections alive to Redis. I implemented a proper pooling of connections to Redis, and the client connections to elasticsearch stopped complaining about the dial tcp errors. I have to look into how to increase the number of available sockets to a user in Mac OS

limoli · 2018-12-03T16:13:44Z

Thank you, disabling HealthCheck seems like the better solution for now.

The code below will not produce this error:

client, err := elastic.NewClient(
	elastic.SetURL(
		"http://localhost:9200",
	),
	elastic.SetSniff(false),
	elastic.SetHealthcheck(false),
	elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
	elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
)

I am still getting the same problem/behaviour. I can make it working just disabling health check. I have imported the library using dep and this is the current version:

[[projects]]
  digest = "1:995fe8c9729e94587361e836896824337bbceab8030f698a40552b1a63bf2c59"
  name = "github.com/olivere/elastic"
  packages = [
    ".",
    "config",
    "uritemplates",
  ]
  pruneopts = "UT"
  revision = "1619150b007041b6dba8aa447f0e2d151cc2b4c5"
  version = "v6.2.14"

My golang version is 1.11.

Any advices @olivere ?

olivere self-assigned this Mar 17, 2017

olivere added a commit that referenced this issue Mar 17, 2017

Don't mark nodes as dead on canceled context

8f58b19

See #484

olivere added a commit that referenced this issue Mar 26, 2017

Don't mark nodes as dead on canceled context

0753243

See #484

olivere closed this as completed Mar 26, 2017

olivere reopened this Aug 1, 2017

olivere closed this as completed Aug 1, 2017

dylannz pushed a commit to HomesNZ/elastic that referenced this issue Aug 7, 2017

Don't mark nodes as dead on canceled context

446b8ee

See olivere#484

olivere reopened this Jun 19, 2018

olivere added a commit that referenced this issue Jul 2, 2018

Do not mark nodes as dead when context is canceled

d1ec6a6

When the retrier fails due to a context timeout, don't mark the node as dead. This is a possible fix to #484.

olivere closed this as completed in 525a37b Jul 11, 2018

yudidi mentioned this issue Dec 21, 2020

Just want the final solution for some previous similar issues(*nodes marked as dead) #1449

Closed

Suraiya-Hameed mentioned this issue Jan 28, 2021

Disable healthcheck on es client tigera/operator#1100

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the context is cancelled the node is marked dead #484

When the context is cancelled the node is marked dead #484

AndreKR commented Mar 17, 2017

olivere commented Mar 17, 2017

olivere commented Mar 17, 2017

olivere commented Mar 17, 2017

AndreKR commented Mar 22, 2017 •

edited

Loading

olivere commented Mar 26, 2017

AndreKR commented Aug 1, 2017

olivere commented Aug 1, 2017

olivere commented Aug 1, 2017

olivere commented Aug 1, 2017

AndreKR commented Aug 1, 2017

wedneyyuri commented Jun 19, 2018 •

edited

Loading

olivere commented Jun 19, 2018

wedneyyuri commented Jun 22, 2018 •

edited

Loading

olivere commented Jul 1, 2018

wedneyyuri commented Jul 1, 2018

olivere commented Jul 2, 2018 •

edited

Loading

wedneyyuri commented Jul 10, 2018

olivere commented Jul 11, 2018

arunplayground commented Jul 16, 2018

AndreKR commented Jul 16, 2018

arunplayground commented Jul 16, 2018

olivere commented Jul 16, 2018

arunplayground commented Jul 16, 2018

limoli commented Dec 3, 2018 •

edited

Loading

When the context is cancelled the node is marked dead #484

When the context is cancelled the node is marked dead #484

Comments

AndreKR commented Mar 17, 2017

Version

How to reproduce:

Actual

Expected

olivere commented Mar 17, 2017

olivere commented Mar 17, 2017

olivere commented Mar 17, 2017

AndreKR commented Mar 22, 2017 • edited Loading

olivere commented Mar 26, 2017

AndreKR commented Aug 1, 2017

olivere commented Aug 1, 2017

olivere commented Aug 1, 2017

olivere commented Aug 1, 2017

AndreKR commented Aug 1, 2017

wedneyyuri commented Jun 19, 2018 • edited Loading

olivere commented Jun 19, 2018

wedneyyuri commented Jun 22, 2018 • edited Loading

olivere commented Jul 1, 2018

wedneyyuri commented Jul 1, 2018

olivere commented Jul 2, 2018 • edited Loading

wedneyyuri commented Jul 10, 2018

olivere commented Jul 11, 2018

arunplayground commented Jul 16, 2018

AndreKR commented Jul 16, 2018

arunplayground commented Jul 16, 2018

olivere commented Jul 16, 2018

arunplayground commented Jul 16, 2018

limoli commented Dec 3, 2018 • edited Loading

AndreKR commented Mar 22, 2017 •

edited

Loading

wedneyyuri commented Jun 19, 2018 •

edited

Loading

wedneyyuri commented Jun 22, 2018 •

edited

Loading

olivere commented Jul 2, 2018 •

edited

Loading

limoli commented Dec 3, 2018 •

edited

Loading