Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres Backend on Azure - Change Feed contantly crashing - ERROR: invalid hexadecimal digit: "-" (SQLSTATE 22023) #29911

Closed
anthonysomerset opened this issue Aug 2, 2023 · 13 comments · Fixed by #29965
Assignees
Labels

Comments

@anthonysomerset
Copy link
Contributor

Expected behavior:
unclear whats broken and should be expected

Current behavior:
unclear whats actually broken - general working of teleport seems to be OK - but getting constant error about change feed stopping in logs

Bug details:

  • Teleport version - 13.3.0
  • Recreation steps

Deploy Helm chart on AKS using azure config for postgres and storage backend
using postgres single server in azure

  • Debug logs
2023-08-02T15:01:33Z INFO [PGBK]      Change feed started. slot_name:da158aeccba74b6591580526fc587072 pgbk/background.go:170
2023-08-02T15:01:39Z WARN [AUTH:1:CA] Re-init the cache on error error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError watcher is closed: <nil>
Stack Trace:
	github.com/gravitational/teleport/lib/cache/cache.go:1158 github.com/gravitational/teleport/lib/cache.(*Cache).fetchAndWatch
	github.com/gravitational/teleport/lib/cache/cache.go:956 github.com/gravitational/teleport/lib/cache.(*Cache).update
	runtime/asm_amd64.s:1598 runtime.goexit
User Message: watcher is closed: <nil>] cache/cache.go:962
2023-08-02T15:01:39Z ERRO [PGBK]      Change feed stream lost. error:[
ERROR REPORT:
Original Error: *pgconn.PgError ERROR: invalid hexadecimal digit: "-" (SQLSTATE 22023)
Stack Trace:
	github.com/gravitational/teleport/lib/backend/pgbk/background.go:263 github.com/gravitational/teleport/lib/backend/pgbk.(*Backend).pollChangeFeed
	github.com/gravitational/teleport/lib/backend/pgbk/background.go:175 github.com/gravitational/teleport/lib/backend/pgbk.(*Backend).runChangeFeed
	github.com/gravitational/teleport/lib/backend/pgbk/background.go:101 github.com/gravitational/teleport/lib/backend/pgbk.(*Backend).backgroundChangeFeed
	github.com/gravitational/teleport/lib/backend/pgbk/pgbk.go:206 github.com/gravitational/teleport/lib/backend/pgbk.NewWithConfig.func2
	runtime/asm_amd64.s:1598 runtime.goexit
User Message: ERROR: invalid hexadecimal digit: "-" (SQLSTATE 22023)] pgbk/background.go:105
2023-08-02T15:01:39Z WARN [PROC:1]    Sync rotation state cycle failed. Retrying in ~10s pid:8.1 service/connect.go:826
2023-08-02T15:01:39Z WARN [KUBERNETE] Re-init the cache on error error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError watcher is closed: rpc error: code = Unknown desc = watcher closed
Stack Trace:
	github.com/gravitational/teleport/lib/cache/cache.go:1158 github.com/gravitational/teleport/lib/cache.(*Cache).fetchAndWatch
	github.com/gravitational/teleport/lib/cache/cache.go:956 github.com/gravitational/teleport/lib/cache.(*Cache).update
	runtime/asm_amd64.s:1598 runtime.goexit
User Message: watcher is closed: rpc error: code = Unknown desc = watcher closed] cache/cache.go:962
2023-08-02T15:01:39Z WARN             Restarting watch on error after waiting 389.200812ms. Error: watcher closed. local/headlessauthn_watcher.go:156
2023-08-02T15:01:40Z WARN [AUTH:1]    Restart watch on error: watcher is closed: <nil>. pid:8.1 resource-kind:lock services/watcher.go:246
2023-08-02T15:01:44Z WARN [KUBERNETE] Restart watch on error: watcher is closed: rpc error: code = Unknown desc = watcher closed. pid:8.1 resource-kind:lock services/watcher.go:246
2023-08-02T15:01:49Z INFO [PGBK]      Starting change feed stream. pgbk/background.go:100
2023-08-02T15:01:49Z INFO [PGBK]      Setting up change feed. slot_name:46d6e54ad7aa4e4f8d8f99e530df1b4e pgbk/background.go:162
@espadolini
Copy link
Contributor

Azure Database for PostgreSQL Single Server is capped at postgres v11 (and will be retired in march 2025), whereas the docs that we're about to merge (😬) will state that we require postgres 13 or later, mostly because of the performance improvements with regards to logical decoding. That being said, I would've still expected it to work on v11. I'll do some test and see what's going on.

That being said, I strongly recommend using a Flexible Server, which should support up to postgres v15 on all regions now.

@espadolini
Copy link
Contributor

espadolini commented Aug 2, 2023

@anthonysomerset do you have bytea_output set to escape by any chance?

I was able to reproduce this with bytea_output set to escape; if that's your situation, as a workaround, add bytea_output=hex to your connstring. A future fix on the Teleport side will be pretty much just that.

(for context https://github.com/eulerto/wal2json/blob/30d2ec11dfaa54980db4e2e1e3de8fe7ed71b6cb/wal2json.c#L1321-L1323, tip of master at time of writing)

@anthonysomerset
Copy link
Contributor Author

thanks i see its set to HEX server side too

looking at how to change the conn string in the helm config at moment - will advise

@espadolini
Copy link
Contributor

In the chart you have to override the whole string in auth.teleportConfig.teleport.storage.conn_string, unfortunately. Alternatively you could see if there's a parameter override for bytea_output for the user used by Teleport.

If it's not that problem, could you psql in the same database as the backend (teleport_backend), do the following, and report back what you see?

SELECT * FROM pg_create_logical_replication_slot('foobar', 'wal2json', true);

SELECT data FROM pg_logical_slot_get_changes('foobar', NULL, NULL, 'format-version', '2', 'add-tables', 'public.kv', 'include-transaction', 'false');

Repeat the last statement as necessary to get events, as it might return nothing if there's no changes. Disconnect the session to drop the temporary slot that was created.

@anthonysomerset
Copy link
Contributor Author

anthonysomerset commented Aug 2, 2023

Thanks

overriding the conn_string ( i also override the audit storage too just in case because same db... )

but still getting the errors

running those commands above - i seem to get a number of results - nothing looks out of ordinary they look like plain old json

I need to work on migrating this specific postgres instance to flexible anyway will also advise how that goes

@espadolini
Copy link
Contributor

Can I get an example json string?

What version of postgres is this?

@anthonysomerset
Copy link
Contributor Author

"{""action"":""U"",""schema"":""public"",""table"":""kv"",""columns"":[{""name"":""key"",""type"":""bytea"",""value"":""2f6461746162617365536572766572732f64656661756c742f32346662363838342d383962652d346137352d616536342d3065646563356439626235322f6c742d68712d7072642d6d7973716c30322d7a61""},{""name"":""expires"",""type"":""timestamp with time zone"",""value"":""2023-08-02 17:49:01.762002+00""},{""name"":""revision"",""type"":""uuid"",""value"":""57680326-6c23-4155-b3d6-f87f2e8ddf53""}],""identity"":[{""name"":""key"",""type"":""bytea"",""value"":""2f6461746162617365536572766572732f64656661756c742f32346662363838342d383962652d346137352d616536342d3065646563356439626235322f6c742d68712d7072642d6d7973716c30322d7a61""}]}"

Its v11 on azure single server right now

@espadolini
Copy link
Contributor

Just to confirm that the issue is what I think it is, could you SELECT length(value) FROM kv WHERE key = '\x 2f6461746162617365536572766572732f64656661756c742f32346662363838342d383962652d346137352d616536342d3065646563356439626235322f6c742d68712d7072642d6d7973716c30322d7a61' and report the value?

@anthonysomerset
Copy link
Contributor Author

length returns 5919

@anthonysomerset
Copy link
Contributor Author

FYI i migrated to postgres flexible (version 15.2)

i appear to get the same behaviour

@espadolini
Copy link
Contributor

The issue is that in keepalive operations we don't touch the value, but if the value column is TOASTed for that row, we can't read it in the change feed. #29965 fixes that (with a more long term fix that will probably rely on REPLICA IDENTITY FULL at a later time).

Sorry about the confusion, although migrating to flexible server was something you should've eventually done regardless I guess 😅

@espadolini
Copy link
Contributor

@anthonysomerset 13.3.1 was released with a fix for this, could you give it a try?

@anthonysomerset
Copy link
Contributor Author

already done and confirmed good now thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants