Update publisher to re-connect #33

alexellis · 2018-07-25T09:24:06Z

Signed-off-by: Alex Ellis (VMware) alexellis2@gmail.com

Description

The publisher will now re-connect when NATS Streaming becomes
unavailable. Tested within the gateway code on Docker Swarm.

Thanks to @vosmith for initiating this work.

Motivation and Context

#17

How Has This Been Tested?

In the gateway, by scaling NATS Streaming to zero replicas and back again.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

@vosmith @kozlovic @stefanprodan how is this looking?

I want to move the work started by @vosmith forward gradually. I realize that an in-memory restart is not ideal, so we're also looking to provide a configuration with some form of "ft" or PV.

@vosmith

The publisher will now re-connect when NATS Streaming becomes unavailable. Tested within the gateway code on Docker Swarm. Thanks to @vosmith for initiating this work. Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>

alexellis · 2018-07-25T09:25:17Z

handler/handler.go

@@ -50,7 +64,26 @@ func (q *NatsQueue) Queue(req *queue.Request) error {
 		log.Println(err)
 	}

-	err = q.nc.Publish(q.Topic, out)
+	err = q.stanConn.Publish(q.Topic, out)


@kozlovic I find that this code blocks indefinitely if I scale NATS Streaming to zero replicas. It would be better, to check if the stanConn is ready/healthy before submitting the work, or potentially doing into a back-off loop.

Publish() is a synchronous call that waits for the ack back from the server. The timeout is specified when creating the stan connection with the PubAckWait(t time.Duration) option.
If you don't want to block, you can use the async version PublishAsync(), but then you should setup an ack handler if you want to ensure that messages are persisted ok.

kozlovic

I would recommend you having a look at the new client connection lost handler, but this requires running with server 0.10.2 though.
Regardless, if you keep the reconnect logic here, I believe some changes are required.

kozlovic · 2018-07-25T13:43:53Z

handler/handler.go

@@ -50,7 +64,26 @@ func (q *NatsQueue) Queue(req *queue.Request) error {
 		log.Println(err)
 	}

-	err = q.nc.Publish(q.Topic, out)
+	err = q.stanConn.Publish(q.Topic, out)


Publish() is a synchronous call that waits for the ack back from the server. The timeout is specified when creating the stan connection with the PubAckWait(t time.Duration) option.
If you don't want to block, you can use the async version PublishAsync(), but then you should setup an ack handler if you want to ensure that messages are persisted ok.

kozlovic · 2018-07-25T13:48:41Z

handler/handler.go

+		return nil, err
+	}
+
+	stanConn, err := stan.Connect(clusterID, clientID, stan.NatsConn(natsConn))


In current go-nats-streaming release, we have added a handler to be notified when the connection is lost. It is a bit too long to explain here, so have a look at this.

Note that your current reconnect handling may still be needed since once the connection is gone, you need to recreate connection and subs, but maybe you should rely on the stan connection handler instead. I would recommend not passing the NATS connection (let the stan create and own it) unless you need specific settings. Note that you can pass the URL to stan with stan.NatsURL() option.

kozlovic · 2018-07-25T13:56:11Z

handler/handler.go

+	return func(c *nats.Conn) {
+		oldConn := q.stanConn
+
+		defer oldConn.Close()


With the way you handle reconnect (at the NATS low level), it is possible that the Streaming server still has the connection registered (remember, streaming server connects to NATS and clients connect to NATS, so no direct connection between client and server) because it did not detect the miss of HBs from client.

So this is pretty broken actually, because it is possible that your Connect() below will fail due to a duplicate (same clientID will be detected as duplicate by Streaming server, which will contact the old connection to its INBOX which will be valid since it is reconnected to NATS) and still you will close the old connection.

So with this current model, I would first close the old connection and then create the new one.

kozlovic · 2018-07-25T13:57:25Z

handler/handler.go

@@ -50,7 +64,26 @@ func (q *NatsQueue) Queue(req *queue.Request) error {
 		log.Println(err)
 	}

-	err = q.nc.Publish(q.Topic, out)
+	err = q.stanConn.Publish(q.Topic, out)


Note that the reconnect handler is async, so if you replace the connection there, and access it here, you need some locking to avoid races.

alexellis · 2018-07-25T16:30:54Z

Thanks for the feedback. If we moved to the 0.12.0 version, how would you feel about spending a half hour or so to write the patch to fix up the re-connect logic?

kozlovic · 2018-10-19T22:32:01Z

@alexellis I did not have time to have a look, but I see that you are now using 0.4.0 client lib, which has the connection lost handler. If you upgrade to 0.11.2 server, then you could make use of that. I can try to submit a PR when I get the chance. However, not sure how I would be testing that, so may ask you to have a go once you have the PR.

daikeren · 2019-01-03T06:14:33Z

@alexellis Is there anything I could help to fix this issue? Currently we're facing the same issue and the only way to solve it is to restart queue-worker and faas gateway. Thanks

alexellis · 2019-02-21T16:21:49Z

Derek close: implemented via other PRs

vosmith · 2019-02-21T22:55:55Z

🎉🎉🎉

alexellis · 2019-02-22T07:56:49Z

Thanks @vosmith this has been in both the queue worker and gateway for several releases now. I wanted to close the old PR.

Update publisher to re-connect

cf3c558

The publisher will now re-connect when NATS Streaming becomes unavailable. Tested within the gateway code on Docker Swarm. Thanks to @vosmith for initiating this work. Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>

alexellis commented Jul 25, 2018

View reviewed changes

kozlovic suggested changes Jul 25, 2018

View reviewed changes

vosmith mentioned this pull request Nov 8, 2018

NATS streaming reconnect handlers for gateway and queue-workers #17

Closed

11 tasks

ivanayov mentioned this pull request Nov 9, 2018

NATS clients not reconnecting if NATS server fails over openfaas/faas#579

Closed

alexellis mentioned this pull request Jan 14, 2019

Investigate: NATS Streaming crash/lock-up openfaas/faas#1031

Closed

6 tasks

This was referenced Jan 14, 2019

Added reconnection logic when NATS is disconnected #49

Merged

Implemented reconnection logic in queue-worker #52

Closed

derek bot closed this Feb 21, 2019

alexellis deleted the reconnect_publisher branch February 21, 2019 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update publisher to re-connect #33

Update publisher to re-connect #33

alexellis commented Jul 25, 2018

alexellis Jul 25, 2018

kozlovic Jul 25, 2018

kozlovic left a comment

kozlovic Jul 25, 2018

kozlovic Jul 25, 2018

kozlovic Jul 25, 2018

kozlovic Jul 25, 2018

alexellis commented Jul 25, 2018

kozlovic commented Oct 19, 2018

daikeren commented Jan 3, 2019

alexellis commented Feb 21, 2019

vosmith commented Feb 21, 2019

alexellis commented Feb 22, 2019

Update publisher to re-connect #33

Update publisher to re-connect #33

Conversation

alexellis commented Jul 25, 2018

Description

Motivation and Context

How Has This Been Tested?

Types of changes

alexellis Jul 25, 2018

Choose a reason for hiding this comment

kozlovic Jul 25, 2018

Choose a reason for hiding this comment

kozlovic left a comment

Choose a reason for hiding this comment

kozlovic Jul 25, 2018

Choose a reason for hiding this comment

kozlovic Jul 25, 2018

Choose a reason for hiding this comment

kozlovic Jul 25, 2018

Choose a reason for hiding this comment

kozlovic Jul 25, 2018

Choose a reason for hiding this comment

alexellis commented Jul 25, 2018

kozlovic commented Oct 19, 2018

daikeren commented Jan 3, 2019

alexellis commented Feb 21, 2019

vosmith commented Feb 21, 2019

alexellis commented Feb 22, 2019