Skip to content
This repository has been archived by the owner on Nov 19, 2021. It is now read-only.

GTM healthcheck issue #29

Closed
sstubbs opened this issue Jul 30, 2019 · 13 comments
Closed

GTM healthcheck issue #29

sstubbs opened this issue Jul 30, 2019 · 13 comments
Assignees

Comments

@sstubbs
Copy link
Contributor

sstubbs commented Jul 30, 2019

I seem to be getting this error on the GTM.
Expecting a startup message, but received �

I wonder if it's related to this.
#15

both inserting and querying both coordinators is working though.

I will try and create another cluster and see if this issue is still there.

@tiredpixel
Copy link
Owner

Yeah, this is still an issue. The healthcheck was improved, but it still doesn't send a GTM-compatible payload. Everything should work fine, though. A patch to improve the healthcheck to not have this error would be gratefully received, if you can find a simple, one-line way to do it without changing lots of things (otherwise, I'd just leave it as-is, even though it's not ideal).

@sstubbs
Copy link
Contributor Author

sstubbs commented Jul 31, 2019

OK if I come up with anything I will let you know. It's a minor issue and doesn't effect anything running. Just thought I would let you know.

@tiredpixel
Copy link
Owner

Okay, thanks. Yes, it shouldn't affect anything. But I would love to resolve this properly. Perhaps we can see what the initial bytes are sent over the wire to advertise as a valid payload, and send them without any further data? i.e. by simulating whatever the Coordinators and Datanodes send in their hello-type message.

@sstubbs
Copy link
Contributor Author

sstubbs commented Aug 8, 2019

OK I will have a look at this.

@sstubbs
Copy link
Contributor Author

sstubbs commented Aug 14, 2019

I've tried nc -lkv "${PG_HOST}" 6666 and other host options. Not getting any output though. Do you have any ideas what I should try? I've tried wireshark but that needs a gui. I've looked at nmap but from my understanding it uses netcat anyway. I really would like to get this fixed but I'm not really a networking expert. Ideally I would like to post something on one of 2ndquadrants lists but I haven't had responses in the past so I'm not sure if I'm asking questions in the right place.

@tiredpixel
Copy link
Owner

Haven't had chance to look into this much, yet, but I think maybe it's contrib/pgxc_monitor/pgxc_monitor.c in the Postgres-XL source, with

switch(nodetype)
	{
		case GTM:
			exit(do_gtm_ping(host, port, nodetype, nodename, verbose));
/*
 * Ping a given GTM or GTM-proxy
 */
static int
do_gtm_ping(char *host, char* port, nodetype_t nodetype, char *nodename, bool verbose)
{

So, I suppose it's the gtm/gtm_client.h include or following further src/gtm/client/fe-connect.c the GTMPQconnectPoll() function and specifically:

		case CONNECTION_MADE:
			{
				GTM_StartupPacket *sp = (GTM_StartupPacket *)
					malloc(sizeof(GTM_StartupPacket));
				int packetlen = sizeof(GTM_StartupPacket);

				MemSet(sp, 0, sizeof(GTM_StartupPacket));

				/*
				 * Build a startup packet. We tell the GTM server/proxy our
				 * PGXC Node name and whether we are a proxy or not.
				 *
				 * When the connection is made from the proxy, we let the GTM
				 * server know about it so that some special headers are
				 * handled correctly by the server.
				 */
				strncpy(sp->sp_node_name, conn->gc_node_name, SP_NODE_NAME);
				sp->sp_remotetype = conn->remote_type;
				sp->sp_ispostmaster = conn->is_postmaster;
				sp->sp_client_id = conn->my_id;

				/*
				 * Send the startup packet.
				 *
				 * Theoretically, this could block, but it really shouldn't
				 * since we only got here if the socket is write-ready.
				 */
				if (pqPacketSend(conn, 'A', (char *)sp, packetlen) != STATUS_OK)

So, I guess it's possible to get the info by following that struct, or perhaps seeing if there's a test case somewhere that calls and checks it. Alternatively (and possibly easier), it might be possible to set up something to log incoming traffic, but that would assume a single-stage handshake, which might well not be the case. Or indeed to sniff the traffic as you were looking at.

Or, I suppose there's the option to use pgxc_monitor directly—but this seems very heavy, to me, especially as the images no longer contain pgxc_ctl.

I'll try to circle back round to this at some point, when I get a bit more time. :)

tiredpixel added a commit that referenced this issue Sep 4, 2019
Previously, although the healthcheck succeeded and everything seemed to
work, the GTM logged error

    Expecting a startup message, but received �

Fix by reverse-engineering the minimal startup packet for the GTM, using
tcpdump and nikolaka/netshoot image tcpdump using a command like

    docker run -it --rm --net container:e0f3eec77071 nicolaka/netshoot \
        tcpdump -X -i lo
@tiredpixel
Copy link
Owner

Using nikolaka/netshoot image to provide tcpdump, in combination with Postgres-XL source src/gtm/main/main.c to provide some more context, a valid minimal startup packet when connected to from a Datanode called data_1 is:

14:53:37.896875 IP postgres-xl-docker_db_data_1_1_34fd20451314.postgres-xl-docker_db_a.42666 > e0f3eec77071.6666: Flags [P.], seq 1:82, ack 1, win 229, options [nop,nop,TS val 10782753 ecr 10782753], length 81
	0x0000:  4500 0085 5e7a 4000 4006 baa2 c0a8 5003  E...^z@.@.....P.
	0x0010:  c0a8 5002 a6aa 1a0a 44aa d65a 9aca d1bb  ..P.....D..Z....
	0x0020:  8018 00e5 21ce 0000 0101 080a 00a4 8821  ....!..........!
	0x0030:  00a4 8821 4100 0000 5064 6174 615f 3100  ...!A...Pdata_1.
	0x0040:  0000 0000 0000 0000 0000 0000 0000 0000  ................
	0x0050:  0000 0000 0000 0000 0000 0000 0000 0000  ................
	0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................
	0x0070:  0000 0000 0000 0000 0006 0000 0000 0000  ................
	0x0080:  0000 0000 00 

Stripping the header and null-padding appropriately to not cause GTM errors (such as OOM), a valid connection is:

echo -n -e "\x41\x00\x00\x00\x50\x64\x61\x74\x61\x5f\x31\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" | nc -w 1 "${PG_HOST}" "${PG_PORT}"

Replacing name data_1 with _healthcheck (it doesn't seem to need to be a valid node) and calculating the padding appropriately, as well as changing echo -n -e to printf "%b" to be more portable, yields:

printf "%b" "\x41\x00\x00\x00\x50\x5f\x68\x65\x61\x6c\x74\x68\x63\x68\x65\x63\x6b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" | nc -w 1 "${PG_HOST}" "${PG_PORT}"

Please see #33 for a working implementation.

Please could you kindly try running it locally (I haven't built it into an image), and see if it solves the problem for you? Thanks.

@JuliuszJ
Copy link

JuliuszJ commented Sep 7, 2019

Hi,
is pgxc_monitor good alternative for health checking of GTM?
Thank you
Juliusz

@tiredpixel
Copy link
Owner

tiredpixel commented Sep 7, 2019

Hi @JuliuszJ . I think that's in the same contrib set as pgxc_ctl, right? Which I dropped from the image a while back, along with all the dependencies [1858d36]. I did actually consider this when looking through the sourcecode, but I'm not sure whether adding another program just for this is worth it (although I haven't absolutely decided against it, either). However, wouldn't it require parsing the output anyway? And more importantly, does it not require SSH and all the setup that pgxc_ctl does?

@JuliuszJ
Copy link

JuliuszJ commented Sep 8, 2019

Thank you @tiredpixel for quick response.

Hi @JuliuszJ . I think that's in the same contrib set as pgxc_ctl, right?

It seems that PG-XL team moved pgxc_ctl from contrib to src/bin. pgxc_monitor left as separate contrib.

Which I dropped from the image a while back, along with all the dependencies [1858d36]. I did actually consider this when looking through the sourcecode, but I'm not sure whether adding another program just for this is worth it (although I haven't absolutely decided against it, either).

I am asking because magic scary me ;)

However, wouldn't it require parsing the output anyway?

The doc says: "If the target node is running, it exits with exit code zero. If not, it exits with a non-zero exit code. "

And more importantly, does it not require SSH and all the setup that pgxc_ctl does?

No SSH, no setup, simple command line.

Thank you
Juliusz

@tiredpixel
Copy link
Owner

@JuliuszJ, interesting, thanks; I didn't realise that. Let me take another look at it; I, too, am dubious of needless magic—but equally, I don't want to introduce some whole new piece because of it. But if it's as you say, it might well be suitable. I'll try to find some time in a bit, and run some tests.

tiredpixel added a commit that referenced this issue Sep 13, 2019
tiredpixel added a commit that referenced this issue Sep 13, 2019
Previously, although the healthcheck succeeded and everything seemed to
work, the GTM logged error

    Expecting a startup message, but received �

Fix by replacing netcat with pgxc_monitor, and to check GTM health.

Many thanks to @sstubbs for motivating me to fix this, and to @JuliuszJ
for the suggestion to use pgxc_monitor instead of magic.
@tiredpixel
Copy link
Owner

That's much better—thank you @JuliuszJ ! I didn't realise it would be so easy. I've replaced my magic with pgxc_monitor ; it seems to work fine.

@tiredpixel tiredpixel self-assigned this Sep 13, 2019
@tiredpixel
Copy link
Owner

Seems fine to me. This will be included in the next release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants