GTM healthcheck issue #29

sstubbs · 2019-07-30T14:38:28Z

I seem to be getting this error on the GTM.
Expecting a startup message, but received �

I wonder if it's related to this.
#15

both inserting and querying both coordinators is working though.

I will try and create another cluster and see if this issue is still there.

tiredpixel · 2019-07-31T10:30:49Z

Yeah, this is still an issue. The healthcheck was improved, but it still doesn't send a GTM-compatible payload. Everything should work fine, though. A patch to improve the healthcheck to not have this error would be gratefully received, if you can find a simple, one-line way to do it without changing lots of things (otherwise, I'd just leave it as-is, even though it's not ideal).

sstubbs · 2019-07-31T10:47:36Z

OK if I come up with anything I will let you know. It's a minor issue and doesn't effect anything running. Just thought I would let you know.

tiredpixel · 2019-08-08T08:10:28Z

Okay, thanks. Yes, it shouldn't affect anything. But I would love to resolve this properly. Perhaps we can see what the initial bytes are sent over the wire to advertise as a valid payload, and send them without any further data? i.e. by simulating whatever the Coordinators and Datanodes send in their hello-type message.

sstubbs · 2019-08-08T08:18:06Z

OK I will have a look at this.

sstubbs · 2019-08-14T20:49:01Z

I've tried nc -lkv "${PG_HOST}" 6666 and other host options. Not getting any output though. Do you have any ideas what I should try? I've tried wireshark but that needs a gui. I've looked at nmap but from my understanding it uses netcat anyway. I really would like to get this fixed but I'm not really a networking expert. Ideally I would like to post something on one of 2ndquadrants lists but I haven't had responses in the past so I'm not sure if I'm asking questions in the right place.

tiredpixel · 2019-08-29T06:52:21Z

Haven't had chance to look into this much, yet, but I think maybe it's contrib/pgxc_monitor/pgxc_monitor.c in the Postgres-XL source, with

switch(nodetype)
	{
		case GTM:
			exit(do_gtm_ping(host, port, nodetype, nodename, verbose));

/*
 * Ping a given GTM or GTM-proxy
 */
static int
do_gtm_ping(char *host, char* port, nodetype_t nodetype, char *nodename, bool verbose)
{

So, I suppose it's the gtm/gtm_client.h include or following further src/gtm/client/fe-connect.c the GTMPQconnectPoll() function and specifically:

		case CONNECTION_MADE:
			{
				GTM_StartupPacket *sp = (GTM_StartupPacket *)
					malloc(sizeof(GTM_StartupPacket));
				int packetlen = sizeof(GTM_StartupPacket);

				MemSet(sp, 0, sizeof(GTM_StartupPacket));

				/*
				 * Build a startup packet. We tell the GTM server/proxy our
				 * PGXC Node name and whether we are a proxy or not.
				 *
				 * When the connection is made from the proxy, we let the GTM
				 * server know about it so that some special headers are
				 * handled correctly by the server.
				 */
				strncpy(sp->sp_node_name, conn->gc_node_name, SP_NODE_NAME);
				sp->sp_remotetype = conn->remote_type;
				sp->sp_ispostmaster = conn->is_postmaster;
				sp->sp_client_id = conn->my_id;

				/*
				 * Send the startup packet.
				 *
				 * Theoretically, this could block, but it really shouldn't
				 * since we only got here if the socket is write-ready.
				 */
				if (pqPacketSend(conn, 'A', (char *)sp, packetlen) != STATUS_OK)

So, I guess it's possible to get the info by following that struct, or perhaps seeing if there's a test case somewhere that calls and checks it. Alternatively (and possibly easier), it might be possible to set up something to log incoming traffic, but that would assume a single-stage handshake, which might well not be the case. Or indeed to sniff the traffic as you were looking at.

Or, I suppose there's the option to use pgxc_monitor directly—but this seems very heavy, to me, especially as the images no longer contain pgxc_ctl.

I'll try to circle back round to this at some point, when I get a bit more time. :)

Previously, although the healthcheck succeeded and everything seemed to work, the GTM logged error Expecting a startup message, but received � Fix by reverse-engineering the minimal startup packet for the GTM, using tcpdump and nikolaka/netshoot image tcpdump using a command like docker run -it --rm --net container:e0f3eec77071 nicolaka/netshoot \ tcpdump -X -i lo

tiredpixel · 2019-09-04T16:38:09Z

Using nikolaka/netshoot image to provide tcpdump, in combination with Postgres-XL source src/gtm/main/main.c to provide some more context, a valid minimal startup packet when connected to from a Datanode called data_1 is:

14:53:37.896875 IP postgres-xl-docker_db_data_1_1_34fd20451314.postgres-xl-docker_db_a.42666 > e0f3eec77071.6666: Flags [P.], seq 1:82, ack 1, win 229, options [nop,nop,TS val 10782753 ecr 10782753], length 81
	0x0000:  4500 0085 5e7a 4000 4006 baa2 c0a8 5003  E...^z@.@.....P.
	0x0010:  c0a8 5002 a6aa 1a0a 44aa d65a 9aca d1bb  ..P.....D..Z....
	0x0020:  8018 00e5 21ce 0000 0101 080a 00a4 8821  ....!..........!
	0x0030:  00a4 8821 4100 0000 5064 6174 615f 3100  ...!A...Pdata_1.
	0x0040:  0000 0000 0000 0000 0000 0000 0000 0000  ................
	0x0050:  0000 0000 0000 0000 0000 0000 0000 0000  ................
	0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................
	0x0070:  0000 0000 0000 0000 0006 0000 0000 0000  ................
	0x0080:  0000 0000 00

Stripping the header and null-padding appropriately to not cause GTM errors (such as OOM), a valid connection is:

echo -n -e "\x41\x00\x00\x00\x50\x64\x61\x74\x61\x5f\x31\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" | nc -w 1 "${PG_HOST}" "${PG_PORT}"

Replacing name data_1 with _healthcheck (it doesn't seem to need to be a valid node) and calculating the padding appropriately, as well as changing echo -n -e to printf "%b" to be more portable, yields:

printf "%b" "\x41\x00\x00\x00\x50\x5f\x68\x65\x61\x6c\x74\x68\x63\x68\x65\x63\x6b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" | nc -w 1 "${PG_HOST}" "${PG_PORT}"

Please see #33 for a working implementation.

Please could you kindly try running it locally (I haven't built it into an image), and see if it solves the problem for you? Thanks.

JuliuszJ · 2019-09-07T08:09:54Z

Hi,
is pgxc_monitor good alternative for health checking of GTM?
Thank you
Juliusz

tiredpixel · 2019-09-07T10:08:06Z

Hi @JuliuszJ . I think that's in the same contrib set as pgxc_ctl, right? Which I dropped from the image a while back, along with all the dependencies [1858d36]. I did actually consider this when looking through the sourcecode, but I'm not sure whether adding another program just for this is worth it (although I haven't absolutely decided against it, either). However, wouldn't it require parsing the output anyway? And more importantly, does it not require SSH and all the setup that pgxc_ctl does?

JuliuszJ · 2019-09-08T09:35:11Z

Thank you @tiredpixel for quick response.

Hi @JuliuszJ . I think that's in the same contrib set as pgxc_ctl, right?

It seems that PG-XL team moved pgxc_ctl from contrib to src/bin. pgxc_monitor left as separate contrib.

Which I dropped from the image a while back, along with all the dependencies [1858d36]. I did actually consider this when looking through the sourcecode, but I'm not sure whether adding another program just for this is worth it (although I haven't absolutely decided against it, either).

I am asking because magic scary me ;)

However, wouldn't it require parsing the output anyway?

The doc says: "If the target node is running, it exits with exit code zero. If not, it exits with a non-zero exit code. "

And more importantly, does it not require SSH and all the setup that pgxc_ctl does?

No SSH, no setup, simple command line.

Thank you
Juliusz

tiredpixel · 2019-09-12T08:41:31Z

@JuliuszJ, interesting, thanks; I didn't realise that. Let me take another look at it; I, too, am dubious of needless magic—but equally, I don't want to introduce some whole new piece because of it. But if it's as you say, it might well be suitable. I'll try to find some time in a bit, and run some tests.

This reverts commit 740084e.

@sstubbs

Previously, although the healthcheck succeeded and everything seemed to work, the GTM logged error Expecting a startup message, but received � Fix by replacing netcat with pgxc_monitor, and to check GTM health. Many thanks to @sstubbs for motivating me to fix this, and to @JuliuszJ for the suggestion to use pgxc_monitor instead of magic.

tiredpixel · 2019-09-13T12:28:55Z

That's much better—thank you @JuliuszJ ! I didn't realise it would be so easy. I've replaced my magic with pgxc_monitor ; it seems to work fine.

tiredpixel · 2019-09-19T15:15:53Z

Seems fine to me. This will be included in the next release.

tiredpixel assigned sstubbs Aug 8, 2019

tiredpixel mentioned this issue Sep 4, 2019

improve GTM healthcheck #15

Closed

tiredpixel added a commit that referenced this issue Sep 13, 2019

Revert "[#29] build: improve GTM healthcheck to not throw error"

88333fb

This reverts commit 740084e.

tiredpixel self-assigned this Sep 13, 2019

tiredpixel closed this as completed Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTM healthcheck issue #29

GTM healthcheck issue #29

sstubbs commented Jul 30, 2019 •

edited

tiredpixel commented Jul 31, 2019

sstubbs commented Jul 31, 2019

tiredpixel commented Aug 8, 2019

sstubbs commented Aug 8, 2019

sstubbs commented Aug 14, 2019 •

edited

tiredpixel commented Aug 29, 2019

tiredpixel commented Sep 4, 2019

JuliuszJ commented Sep 7, 2019

tiredpixel commented Sep 7, 2019 •

edited

JuliuszJ commented Sep 8, 2019

tiredpixel commented Sep 12, 2019

tiredpixel commented Sep 13, 2019

tiredpixel commented Sep 19, 2019

GTM healthcheck issue #29

GTM healthcheck issue #29

Comments

sstubbs commented Jul 30, 2019 • edited

tiredpixel commented Jul 31, 2019

sstubbs commented Jul 31, 2019

tiredpixel commented Aug 8, 2019

sstubbs commented Aug 8, 2019

sstubbs commented Aug 14, 2019 • edited

tiredpixel commented Aug 29, 2019

tiredpixel commented Sep 4, 2019

JuliuszJ commented Sep 7, 2019

tiredpixel commented Sep 7, 2019 • edited

JuliuszJ commented Sep 8, 2019

tiredpixel commented Sep 12, 2019

tiredpixel commented Sep 13, 2019

tiredpixel commented Sep 19, 2019

sstubbs commented Jul 30, 2019 •

edited

sstubbs commented Aug 14, 2019 •

edited

tiredpixel commented Sep 7, 2019 •

edited