Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the context is cancelled the node is marked dead #484

Closed
AndreKR opened this issue Mar 17, 2017 · 24 comments
Closed

When the context is cancelled the node is marked dead #484

AndreKR opened this issue Mar 17, 2017 · 24 comments
Assignees

Comments

@AndreKR
Copy link

AndreKR commented Mar 17, 2017

Version

elastic.v5 (for Elasticsearch 5.x)

How to reproduce:

package main

import (
	"context"
	"gopkg.in/olivere/elastic.v5"
	"log"
	"os"
	"time"
)

func main() {

	var err error

	client, err := elastic.NewClient(
		elastic.SetURL("https://httpbin.org/delay/3?"), // every request will take about 3 seconds
		elastic.SetHealthcheck(false),
		elastic.SetSniff(false),
		elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
		elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
	)
	if err != nil {
		log.Fatal(err)
	}

	ctx, _ := context.WithTimeout(context.Background(), 1*time.Second) // requests will time out after 1 second

	log.Println("Running request")

	_, err = client.Get().Index("whatever").Id("1").Do(ctx)

	if err != nil {
		log.Println("Error: " + err.Error())
	}

	log.Println("Running second request")

	_, err = client.Get().Index("whatever").Id("1").Do(ctx)

	if err != nil {
		log.Println("Error: " + err.Error())
	}

}

Actual

2017/03/17 08:02:33 Running request
2017/03/17 08:02:34 elastic: https://httpbin.org/delay/3? is dead
2017/03/17 08:02:34 Error: context deadline exceeded
2017/03/17 08:02:34 Running second request
2017/03/17 08:02:34 elastic: all 1 nodes marked as dead; resurrecting them to prevent deadlock
2017/03/17 08:02:34 Error: no Elasticsearch node available

Expected

Something like (I edited that "log" myself):

2017/03/17 08:02:33 Running request
2017/03/17 08:02:34 Error: context deadline exceeded
2017/03/17 08:02:34 Running second request
2017/03/17 08:02:37 GET https://httpbin.org/delay/3?/whatever/_all/1 [status:200, request:3.500s]
@olivere
Copy link
Owner

olivere commented Mar 17, 2017

Thanks for the helpful issue report. I will look into it asap.

@olivere olivere self-assigned this Mar 17, 2017
olivere added a commit that referenced this issue Mar 17, 2017
@olivere
Copy link
Owner

olivere commented Mar 17, 2017

@AndreKR Can you review the above code, please? I think it's the correct fix, but maybe I've been missing something.

$ ./issue-484
2017/03/17 09:12:27 Running request
2017/03/17 09:12:28 Error: context deadline exceeded
2017/03/17 09:12:28 Running second request
2017/03/17 09:12:28 Error: context deadline exceeded

@olivere
Copy link
Owner

olivere commented Mar 17, 2017

The second result in your example should also fail because you reused the canceled/deadlined context from request 1. If you specify a new context with a timeout of 4s, the output would be:

$ ./issue-484
2017/03/17 09:16:04 Running request
2017/03/17 09:16:05 Error: context deadline exceeded
2017/03/17 09:16:05 Running second request
2017/03/17 09:16:09 client.go:727: GET https://httpbin.org/delay/3?/whatever/_all/1 [status:200, request:3.498s]

@AndreKR
Copy link
Author

AndreKR commented Mar 22, 2017

Yep, works like a charm. (With the fixed test code and in my real application.)

From looking at the code it seems that if an error were to happen during a retry it would still mark the node dead, but that might actually be a sensible thing to do.

olivere added a commit that referenced this issue Mar 26, 2017
@olivere
Copy link
Owner

olivere commented Mar 26, 2017

Will be fixed in 3.0.67 and 5.0.31.

@olivere olivere closed this as completed Mar 26, 2017
@AndreKR
Copy link
Author

AndreKR commented Aug 1, 2017

This issue has returned. I'm not sure why it worked back then but maybe we both were using Go 1.7 at that point?

Anyway, the problem is that http.Client no longer returns context.Canceled when its context was canceled. It now returns an *url.Error, so that this doesn't work anymore.

Fortunately the *url.Error contains the original error in Err, so the problem can be solved again like this:

res, err := c.c.Do((*http.Request)(req).WithContext(ctx))
if uerr, ok := err.(*url.Error); ok {
	if uerr.Err == context.Canceled || uerr.Err == context.DeadlineExceeded {
		// Proceed, but don't mark the node as dead
		return nil, err
	}
}

@olivere
Copy link
Owner

olivere commented Aug 1, 2017

The problem seems to occur on e.g. when redirect fails. I will add another check for that case.

I've already used 1.8 back then.

@olivere olivere reopened this Aug 1, 2017
olivere added a commit that referenced this issue Aug 1, 2017
In certain cases, the returned error on a canceled or deadlined
request is not `context.Canceled` or `context.DeadlineExceeded`,
but a `*url.Error` whose `Error` field carries the above context errors.

In the standard library, there is a specific test for redirects that
checks this case (https://golang.org/src/net/http/client_test.go#L329),
so we fix this in the same way in `PerformRequest`.

See #484
@olivere
Copy link
Owner

olivere commented Aug 1, 2017

The origin of this seems to be 5 years ago :-)

@olivere
Copy link
Owner

olivere commented Aug 1, 2017

Just pushed 5.0.44.

@olivere olivere closed this as completed Aug 1, 2017
@AndreKR
Copy link
Author

AndreKR commented Aug 1, 2017

Fantastic. :)

For the record, this is not related to redirects, try this:

client := http.Client{}
ctx, _ := context.WithTimeout(context.Background(), 1*time.Second)   // requests will time out after 1 second
req, _ := http.NewRequest("GET", "https://httpbin.org/delay/3", nil) // every request will take about 3 seconds
_, err := client.Do(req.WithContext(ctx))
fmt.Println(err)
fmt.Println(reflect.TypeOf(err).String())
fmt.Println(err == context.Canceled)
fmt.Println(err == context.DeadlineExceeded)

Output:

Get https://httpbin.org/delay/3: context deadline exceeded
*url.Error
false
false

dylannz pushed a commit to HomesNZ/elastic that referenced this issue Aug 7, 2017
@wedneyyuri
Copy link
Contributor

wedneyyuri commented Jun 19, 2018

@olivere it seems that we are facing the same problem on elastic.v6.

Version
elastic.v6 (for Elasticsearch 6.x)

How to reproduce:

package main

import (
	"context"
	"log"
	"os"
	"time"

	"gopkg.in/olivere/elastic.v6"
)

func main() {
	var err error
	client, err := elastic.NewClient(
		elastic.SetURL(
			"http://localhost:9200",
		),
		elastic.SetSniff(false),
		elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
		elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
	)
	if err != nil {
		log.Fatal(err)
	}

	for i := 0; i < 50; i++ {
		func(i int) {
			log.Println("Running request ", i)

			ctx, cancelFunc := context.WithTimeout(context.Background(), 1*time.Millisecond)
			defer cancelFunc()

			_, err := client.Get().Index("index-name").Type("_doc").Id("35642796").Do(ctx)
			if err != nil {
				log.Println("Err: " + err.Error())
			}

			log.Println("Finished request ", i)
			log.Print("\n\n\n")
		}(i)
	}
}
Output log

**Log:**

018/06/19 12:01:25 Running request  0
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: context deadline exceeded
2018/06/19 12:01:25 Finished request  0
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  1
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: context deadline exceeded
2018/06/19 12:01:25 Finished request  1
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  2
2018/06/19 12:01:25 elastic: http://localhost:9200 is dead
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: dial tcp: lookup internal-shop-alpha-elastic-blue-node-1202787545.sa-east-1.elb.amazonaws.com on 10.13.31.180:53: dial udp 10.13.31.180:53: i/o timeout
2018/06/19 12:01:25 Finished request  2
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  3
2018/06/19 12:01:25 elastic: all 1 nodes marked as dead; resurrecting them to prevent deadlock
2018/06/19 12:01:25 Err: no available connection: no Elasticsearch node available
2018/06/19 12:01:25 Finished request  3
2018/06/19 12:01:25


2018/06/19 12:01:25 Running request  4
2018/06/19 12:01:25 Err: Get http://localhost:9200/index-name/_doc/35642796: context deadline exceeded
2018/06/19 12:01:25 Finished request  4
2018/06/19 12:01:25

@olivere
Copy link
Owner

olivere commented Jun 19, 2018

I'm on vacation now, but re-opening to look into it when I'm back.

@olivere olivere reopened this Jun 19, 2018
@wedneyyuri
Copy link
Contributor

wedneyyuri commented Jun 22, 2018

Thank you, disabling HealthCheck seems like the better solution for now.

The code below will not produce this error:

client, err := elastic.NewClient(
	elastic.SetURL(
		"http://localhost:9200",
	),
	elastic.SetSniff(false),
	elastic.SetHealthcheck(false),
	elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
	elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
)

@olivere
Copy link
Owner

olivere commented Jul 1, 2018

@wedneyyuri So the problem seems to be that the healthcheck context runs into a timeout, hence the context is canceled. If the healthcheck runs into a timeout, what is it going to do? I think it's correct to mark the connection as dead if the healthcheck doesn't return in time. Am I missing something?

Notice that even if all nodes are dead, PerformRequest will resurrect a connection to complete the request, even if marked as dead.

@wedneyyuri
Copy link
Contributor

@olivere Seems like PerformRequest is resurrecting the connection after returning the error Err: no available connection: no Elasticsearch node available.

I guess it's causing safe requests to be cancelled before they hit the server.

olivere added a commit that referenced this issue Jul 2, 2018
When the retrier fails due to a context timeout, don't mark the node as
dead.

This is a possible fix to #484.
@olivere
Copy link
Owner

olivere commented Jul 2, 2018

@wedneyyuri Can you try the context-canceled.issue-484 branch to see if that fixes the problem?

@wedneyyuri
Copy link
Contributor

Thanks @leezhm and @olivere its working fine on this branch.

@olivere
Copy link
Owner

olivere commented Jul 11, 2018

Will merge this in the next release then. Thanks for your support.

olivere added a commit that referenced this issue Jul 11, 2018
This commit ignores *url.Error errors that are marked as temporary. It
also removes the context timeout checks for retrier added earlier
because that cannot happen in that code path.

Fix #484
@arunplayground
Copy link

I still get the following error no available connection: no Elasticsearch node available

I am running my code in a multi threaded env (10 goroutines). Each go routine has an instance of a client. In each goroutine, I do get documents and insert documents using context.Background().

If the code is fixed, i am not sure where i am at fault. I have disabled sniffing and healthcheck (elastic.SetSniff(false), elastic.SetHealthcheck(false))

Any insights would help

@AndreKR
Copy link
Author

AndreKR commented Jul 16, 2018

If you get no available connection: no Elasticsearch node available you should have gotten another error before this one which should have told you why all nodes have been marked dead.

@arunplayground
Copy link

Get http://127.0.0.1:9200/instruments/doc/IDBBGLOBAL138357_IDBBUNIQUE138357: dial tcp 127.0.0.1:9200:connect: can't assign requested address.

This is the message i get before the "no available connection" error.

Also i changed my code, so that i use a single client for all by goroutines, instead of individual clients per goroutine.

I dont get the error message all the time, but i do get it off and on.
I have disabled sniffing and healthcheck (elastic.SetSniff(false), elastic.SetHealthcheck(false))

@olivere
Copy link
Owner

olivere commented Jul 16, 2018

@arunplayground When seeing connect: can't assign requested address. you are probably exceeding the number of local socket connections permitted by your OS. See e.g. golang/go#16012.

@arunplayground
Copy link

You are absolutely right. There were rouge connections alive to Redis. I implemented a proper pooling of connections to Redis, and the client connections to elasticsearch stopped complaining about the dial tcp errors. I have to look into how to increase the number of available sockets to a user in Mac OS

@limoli
Copy link

limoli commented Dec 3, 2018

Thank you, disabling HealthCheck seems like the better solution for now.

The code below will not produce this error:

client, err := elastic.NewClient(
	elastic.SetURL(
		"http://localhost:9200",
	),
	elastic.SetSniff(false),
	elastic.SetHealthcheck(false),
	elastic.SetErrorLog(log.New(os.Stderr, "", log.LstdFlags)),
	elastic.SetInfoLog(log.New(os.Stdout, "", log.LstdFlags)),
)

I am still getting the same problem/behaviour. I can make it working just disabling health check. I have imported the library using dep and this is the current version:

[[projects]]
  digest = "1:995fe8c9729e94587361e836896824337bbceab8030f698a40552b1a63bf2c59"
  name = "github.com/olivere/elastic"
  packages = [
    ".",
    "config",
    "uritemplates",
  ]
  pruneopts = "UT"
  revision = "1619150b007041b6dba8aa447f0e2d151cc2b4c5"
  version = "v6.2.14"

My golang version is 1.11.

Any advices @olivere ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants