Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate: NATS Streaming crash/lock-up #1031

Closed
6 tasks
alexellis opened this issue Jan 14, 2019 · 7 comments
Closed
6 tasks

Investigate: NATS Streaming crash/lock-up #1031

alexellis opened this issue Jan 14, 2019 · 7 comments

Comments

@alexellis
Copy link
Member

alexellis commented Jan 14, 2019

Expected Behaviour

The connection to the NATS Streaming Server in the gateway should stay up and available to serve asynchronous requests.

Current Behaviour

@padiazg observed with Swarm on two occasions that NATS Streaming appeared to stop accepting new asynchronous requests. I have also noticed this with Kubernetes in OpenFaaS Cloud Community Cluster on one occasion.

Possible Solution

Tasks:

  • Run NATS Streaming in HA mode or with persistence so that it can be started if an issue is detected without losing data. (Document how to deploy with HA NATS deployments docs#101)
  • Document the three HA NATS instructions sent over from the NATS team in the docs site via Document how to deploy with HA NATS deployments docs#101 - in addition we need to give users clear instructions for Kubernetes + Swarm
  • Investigate whether the gateway can reconnect if the NATS Streaming TCP connection is severed (this may need to be simulated by patching the gateway code) - this may be related to whether NATS is running with an in-memory filesystem or with persistence
  • Add re-connect to the gateway handler.go code (publisher)
  • Add re-connect handler to the nats-queue-worker code (subscriber)
  • Evaluate current set of NATS Streaming Prometheus exporters and whether their metrics can be used to create alerts in AlertManager for HipChat/PagerDuty etc.

I don't believe you can restart a connection / subscription if the NATS Streaming Server is running in in-memory mode. See also: openfaas/nats-queue-worker#33

Steps to Reproduce (for bugs)

  1. Unclear

Context

If this crashes then manual action is required and it is currently not easy to know whether it has crashed from a dashboard/alert. This could affect people relying on NATS Streaming in production like @padiazg / Vision.

The configuration of NATS Streaming is "memory" by default:

Patricio has done some experimentation with MySQL as a backing store

Your Environment

  • FaaS-CLI version ( Full output from: faas-cli version ):

  • Docker version docker version (e.g. Docker 17.0.05 ):

  • Are you using Docker Swarm or Kubernetes (FaaS-netes)?

  • Operating System and version (e.g. Linux, Windows, MacOS):

  • Link to your project or a code example to reproduce issue:

  • Please also follow the troubleshooting guide and paste in any other diagnostic information you have:

@padiazg
Copy link
Contributor

padiazg commented Jan 14, 2019

The reason NATS crashes is unknow to me, yet. But the way to reproduce it for testing purposes is killing NATS with this:

$ docker kill $(docker ps -qf "name=func_nats")

After that you will get the same results as when it crashes "in the wild"

Then you can get the stack back to work with this:

docker kill $(docker ps -qf "name=func_nats") $(docker ps -qf "name=func_queue-worker") $(docker ps -qf "name=func_gateway")

@bartsmykla
Copy link

I have some success, I'm testing right now reconnecting logic, but if everything will be fine, I should have PR with solution in around 2h.

@bartsmykla
Copy link

@bartsmykla
Copy link

Summary: What I did so far?

Add re-connect to the gateway handler.go code (publisher)

Add re-connect handler to the nats-queue-worker code (subscriber)`

I also did a research related to prometheus exporters for NATS streaming, and till official NATS exporter wouldn't support NATS Streaming metrics (there is open PR for that: nats-io/prometheus-nats-exporter#54) our best (and actually the only one) option is to use this exporter: https://gitlab.com/civist/nats-streaming-exporter
I couldn't find anything else, and tested that if it's exporting the metrics. I have build and pushed my own docker image and deployed it to my local swarm cluster of OpenFaaS adding to docker-compose.yml:

    nats-streaming-prometheus-exporter:
        image: bartsmykla/nats-streaming-exporter:0.0.1
        networks:
            - functions
        command: "/nats-streaming-exporter -nats-uri http://nats:8222"
        deploy:
            resources:
                limits:
                    memory: 125M
                reservations:
                    memory: 50M
            placement:
                constraints:
                    - 'node.platform.os == linux'
        ports:
            - 9275:9275

so if you want to test that exporter too, feel free to use my image: bartsmykla/nats-streaming-exporter:0.0.1

@alexellis
Copy link
Member Author

Thank you for the update Bart.

Which metrics could @padiazg make use of to alert on or observe the health of his NATS Streaming instance/cluster?

@bartsmykla
Copy link

I think: natsstreaming_up, natsstreaming_exporter_json_parse_failures, natsstreaming_subscriptions_pending. Here is the full list from the exporter:

&Exporter{
		URI:     u,
		Timeout: timeout,
		up: prometheus.NewGauge(prometheus.GaugeOpts{
			Namespace: namespace,
			Name:      "up",
			Help:      "Was the last scrape of nats-streaming successful.",
		}),
		totalScrapes: prometheus.NewCounter(prometheus.CounterOpts{
			Namespace: namespace,
			Name:      "exporter_total_scrapes",
			Help:      "Current total nats-streaming scrapes.",
		}),
		jsonParseFailures: prometheus.NewCounter(prometheus.CounterOpts{
			Namespace: namespace,
			Name:      "exporter_json_parse_failures",
			Help:      "Number of errors while parsing JSON.",
		}),
		clientsTotal:              newDesc("clients", "Number of currently connected clients.", nil),
		channelsTotal:             newDesc("channels", "Current number of channels.", nil),
		storeMessagesTotal:        newDesc("store_messages", "Current number of messages in the store.", nil),
		storeMessagesBytes:        newDesc("store_messages_bytes", "Total size of the messages in the store.", nil),
		subscriptionsTotal:        newDesc("subscriptions", "Number of subscriptions.", []string{"channel", "client"}),
		subscriptionsPendingTotal: newDesc("subscriptions_pending", "Number of pending messages.", []string{"channel", "client"}),
		subscriptionsStalledTotal: newDesc("subscriptions_stalled", "Number of stalled subscriptions.", []string{"channel", "client"}),
		messagesTotal:             newDesc("messages", "Number of messages.", []string{"channel"}),
		messagesBytes:             newDesc("messages_bytes", "Size of the messages.", []string{"channel"}),
	}

@alexellis
Copy link
Member Author

Resolved through patches to gateway and queue-worker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants