Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Once "context deadline exceeded" error happened, need to restart service. #21641

Closed
shawn111 opened this issue Nov 11, 2022 · 10 comments
Closed
Labels
Bug Report/Open Bug report/issue

Comments

@shawn111
Copy link

shawn111 commented Nov 11, 2022

Summary

When the error happened, can not access mattersmost service.
It would redirect to login page and show "Token Error", like https://forum.gitlab.com/t/mattermost-token-error/65279/2.

Check mattermost.log, once this error happen, all db connection were broken.
Nov 11 09:41:13 messaging-server mattermost: {"timestamp":"2022-11-11 09:41:13.085 +08:00","level":"error","msg":"Failed to list plugin key values","caller":"app/plugin_key_value_store.go:167","page":0,"perPage":1000,"error":"failed to get PluginKeyValues with pluginId=playbooks: context deadline exceeded"}

Not sure it is related to jackc/pgx#1313.

Bug report in one concise sentence

Steps to reproduce

Mattermost Version: 7.1.3
Database Schema Version: 89
Database: postgres (13.8)

A complete restart of the system solves the problem, but it comes back every 4-5 day.
We check db / memory / network status, can not find the problem.

Expected behavior

Even the error happened, it should broken that query not all of queries after that.

Observed behavior (that appears unintentional)

Once "context deadline exceeded" error happened, need to restart service.

[root@messaging-server /]# grep "context deadline exceeded" /opt/mattermost/log/mattermost.log | nl
     1	Nov 11 09:40:51 messaging-server mattermost: {"timestamp":"2022-11-11 09:40:51.821 +08:00","level":"error","msg":"Unable to get the posts for the channel.","caller":"web/context.go:105","path":"/api/v4/channels/h9jjsc9cy3fauea87fif76jm9e/posts","request_id":"qpr5rhk1obbd5qj575soi7wrha","ip_addr":"61.227.52.141","user_id":"zcouasondpnxmjot3m7wg64tbw","method":"GET","err_where":"GetPostsSince","http_code":500,"err_details":"failed to find Posts with channelId=h9jjsc9cy3fauea87fif76jm9e: context deadline exceeded"}
     2	Nov 11 09:40:51 messaging-server mattermost: {"timestamp":"2022-11-11 09:40:51.915 +08:00","level":"error","msg":"We encountered an error while finding user profiles.","caller":"web/context.go:105","path":"/api/v4/users/ids","request_id":"6xqay1wbd3gj8djc1d3ptcznww","ip_addr":"61.227.52.141","user_id":"zcouasondpnxmjot3m7wg64tbw","method":"POST","err_where":"GetUsersByIds","http_code":500,"err_details":"failed to find Users: context deadline exceeded"}
     3	Nov 11 09:40:51 messaging-server mattermost: {"timestamp":"2022-11-11 09:40:51.915 +08:00","level":"error","msg":"Unable to get the posts for the channel.","caller":"web/context.go:105","path":"/api/v4/channels/9thezsgwmjfg8jn8p75oxxqz5r/posts","request_id":"bkq17jcc5iga3ru1yhi7m95m7o","ip_addr":"61.227.52.141","user_id":"zcouasondpnxmjot3m7wg64tbw","method":"GET","err_where":"GetPostsSince","http_code":500,"err_details":"failed to find Posts with channelId=9thezsgwmjfg8jn8p75oxxqz5r: context deadline exceeded"}
...
  2038	Nov 11 10:05:15 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:15.853 +08:00","level":"error","msg":"Unable to get the teams unread messages.","caller":"web/context.go:105","path":"/api/v4/users/me/teams/unread","request_id":"89tmianebj865ekh3hikg9uz9o","ip_addr":"182.150.116.153","user_id":"tz3o1c49abr85rp6t4mykrwbiw","method":"GET","err_where":"GetTeamsUnreadForUser","http_code":500,"err_details":"failed to find Channels with userId=tz3o1c49abr85rp6t4mykrwbiw and teamId!=: context deadline exceeded"}
  2039	Nov 11 10:05:15 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:15.897 +08:00","level":"error","msg":"We encountered an error while finding user profiles.","caller":"web/context.go:105","path":"/api/v4/users/ids","request_id":"wmiykmkqpjbjxm5kea6emctcwa","ip_addr":"182.150.116.153","user_id":"tz3o1c49abr85rp6t4mykrwbiw","method":"POST","err_where":"GetUsersByIds","http_code":500,"err_details":"failed to find Users: context deadline exceeded"}
  2040	Nov 11 10:05:16 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:16.686 +08:00","level":"warn","msg":"Error while creating session for user access token","caller":"app/session.go:89","error":"createSessionForUserAccessToken: Invalid or missing token., failed to get UserAccessToken with token=tkkicpgdw7bg8ew8uamcgqdjpo: context deadline exceeded"}
  2041	Nov 11 10:05:17 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:17.025 +08:00","level":"warn","msg":"Error while creating session for user access token","caller":"app/session.go:89","error":"createSessionForUserAccessToken: Invalid or missing token., failed to get UserAccessToken with token=tkkicpgdw7bg8ew8uamcgqdjpo: context deadline exceeded"}
  2042	Nov 11 10:05:17 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:17.106 +08:00","level":"warn","msg":"Error while creating session for user access token","caller":"app/session.go:89","error":"createSessionForUserAccessToken: Invalid or missing token., failed to get UserAccessToken with token=tkkicpgdw7bg8ew8uamcgqdjpo: context deadline exceeded"}
  2043	Nov 11 10:05:17 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:17.537 +08:00","level":"warn","msg":"Error while creating session for user access token","caller":"app/session.go:89","error":"createSessionForUserAccessToken: Invalid or missing token., failed to get UserAccessToken with token=tkkicpgdw7bg8ew8uamcgqdjpo: context deadline exceeded"}
  2044	Nov 11 10:05:18 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:18.724 +08:00","level":"error","msg":"Unable to save the token.","caller":"web/context.go:105","path":"/oauth/gitlab/login","request_id":"ece6t76n5jfcdmfkn5wteayuhe","ip_addr":"221.237.25.97","user_id":"","method":"GET","err_where":"CreateOAuthStateToken","http_code":500,"err_details":"failed to save Token: context deadline exceeded"}
  2045	Nov 11 10:05:22 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:22.237 +08:00","level":"error","msg":"Unable to update the last viewed at time.","caller":"web/context.go:105","path":"/api/v4/channels/members/me/view","request_id":"hzp7mf9fa3n89x3moz6pr3kfjw","ip_addr":"218.94.118.90","user_id":"pdjcf9h3ujymmn9qygc4m967so","method":"POST","err_where":"MarkChannelsAsViewed","http_code":500,"err_details":"failed to mark all threads as read by channels for user id=pdjcf9h3ujymmn9qygc4m967so: context deadline exceeded"}
  2046	Nov 11 10:05:22 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:22.275 +08:00","level":"error","msg":"Unable to update the last viewed at time.","caller":"web/context.go:105","path":"/api/v4/channels/members/me/view","request_id":"yqjn9dq5kfga3jgdxi7h1uq6cy","ip_addr":"218.94.118.90","user_id":"pdjcf9h3ujymmn9qygc4m967so","method":"POST","err_where":"MarkChannelsAsViewed","http_code":500,"err_details":"failed to mark all threads as read by channels for user id=pdjcf9h3ujymmn9qygc4m967so: context deadline exceeded"}
  2047	Nov 11 10:05:22 messaging-server mattermost: {"timestamp":"2022-11-11 10:05:22.297 +08:00","level":"warn","msg":"Failed to save status","caller":"app/status.go:332","user_id":"uyb7inmwdtf18f799c7my1or1w","error":"failed to upsert Status: context deadline exceeded"}

We restarted mattermost at 10:05 after that no such error happen yet.

Possible fixes

Should we upgrade pgx to fix it?
jackc/pgx#1313

@amyblais
Copy link
Member

@amyblais amyblais added the Bug Report/Open Bug report/issue label Nov 11, 2022
@agnivade
Copy link
Member

Hello @shawn111 - an error of "context deadline exceeded" means that the database exceeded the query timeout set by the application.

I'd suggest to check the database logs for any issues. But as you mentioned above, the db does not appear to have any issues.

I think I'd need some more information about your setup to be able to provide further info.

  • Is it a docker setup?
  • Are both the application and DB in the same host?
  • What are your system specs and what is your DB size?

Let me tag @anx-ag who's had some experience debugging these issues before. Alexander, do you know what could be the issue here?

@anx-ag
Copy link
Contributor

anx-ag commented Nov 14, 2022

If it happens unregularly or only when the system has been running for some days, there might be an autovacuum blocking things or reducing performance so that the other queries cannot finish. It's also possible that the application's connection pools get full because of that or that you reach max connections on your database server and the application therefore stops working for new queries while the older ones are timing out.

Additional to what @agnivade requested, can you please also share the SqlSettings block of your config.json file (with the passwords and other sensitive details removed, of course)?
When you say "We check DB": what exactly has been checked? Did you also look at the running processes in the database when the problems are ongoing?

The output of the following query while you're experiencing the problem would be very helpful:

select * from pg_stat_activity;

@shawn111
Copy link
Author

shawn111 commented Nov 14, 2022

@agnivade yes, as beginning. I also thought this issue is caused by db loading.
However, when I check the grafana postgres dashboard. I thought the problem might cause from mattermost side.

{"timestamp":"2022-11-13 09:14:44.587 +08:00","level":"error","msg":"GetChannelMembersForUserWithPagination: Unable to get the channel members., failed to find ChannelMembers data with and userId=1gigwsncqibwdp5euyp64xy3fw: context deadline exceeded","caller":"app/plugin_api.go:940","plugin_id":"com.mattermost.calls","origin":"main.(*Plugin).handleGetAllChannels api.go:129"}

In order to fix it, we restarted the mattermost service aroud 9:16.
image

@anx-ag , Here is the latest 4 days info.
image
image

About my env, mattermost run direct in host server, but postgres already move into k8s env.
They are not in the same internal lan.

It might be the root cause of "context deadline exceeded", but even this issue happened, it should be covered after a moment.

@shawn111
Copy link
Author

shawn111 commented Nov 14, 2022

image

BTW, I'm curious about it seem there is a big query every hours, even in mid night.
We should not have lots of users in mid night.

image

@shawn111
Copy link
Author

pg_dump: error: query failed: ERROR:  permission denied for schema zhparser
pg_dump: error: query was: LOCK TABLE zhparser.zhprs_custom_word IN ACCESS SHARE MODE

@agnivade It seem it might cause by my setting.
We use an zhparser extension, maybe the "context deadline exceeded" is related with the LOCK.

@anx-ag
Copy link
Contributor

anx-ag commented Nov 26, 2022

Is the zhparser extension enabled in the mattermost database? I must admit I was unable to follow the documentation on the Github page properly, but the hourly peaks are most likely not caused by this extension. If you want to find out what queries these are, you could try to enable query logging a few minutes before this happens and disable query logging again afterwards, so we can find out what causes these spikes but they're most likely unrelated to your problem.

Also did you find time to run the query I sent you and did what was the result? Please do also send the requested configuration settings of your config.json.
Are there any other statistics or metrics you have from your database server? The screens you shared here are not really helpful when debugging issues like these, other metrics like query duration, system load, number of queries/second, etc. would be more interesting.

@shawn111
Copy link
Author

shawn111 commented Nov 29, 2022

@anx-ag It's might be a network issue between my pg server and mattermost server.

Ping time is not stable.

2022-11-28 18:19:57 64 bytes from 172.16.0.65: icmp_seq=11491 ttl=62 time=0.390 ms
2022-11-28 18:19:58 64 bytes from 172.16.0.65: icmp_seq=11492 ttl=62 time=0.448 ms
2022-11-28 18:19:59 64 bytes from 172.16.0.65: icmp_seq=11493 ttl=62 time=0.481 ms
2022-11-28 18:20:01 64 bytes from 172.16.0.65: icmp_seq=11494 ttl=62 time=12.6 ms
2022-11-28 18:20:01 64 bytes from 172.16.0.65: icmp_seq=11495 ttl=62 time=0.454 ms
2022-11-28 18:20:02 64 bytes from 172.16.0.65: icmp_seq=11496 ttl=62 time=0.373 ms
2022-11-28 18:20:03 64 bytes from 172.16.0.65: icmp_seq=11497 ttl=62 time=0.370 ms

Database loading still far to reach the limitation.
It's the reason why "context deadline exceeded" error happen.

However, it point out once "context deadline exceeded" happened, all db connection would be broken.
I'm not sure is it related to jackc/pgx#1313

@anx-ag
Copy link
Contributor

anx-ag commented Nov 29, 2022

Getting Context deadline exceeded messages is just the result, not the cause. This usually gets printed when the queries take longer than allowed and long running queries lock your available database connections which might then result in a situation where no free slots are available.

In order to debug this further I would really need the requested outputs and answers to my questions. Do you need help with running those queries or should I rephrase the questions if you've trouble understanding them? Just let me know.

@shawn111
Copy link
Author

shawn111 commented Nov 29, 2022

Got it, I thought in my env is quite clear about network issue to cause this issue.
Thank for your help.
We need fix the network issue first.

I thought we can close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Report/Open Bug report/issue
Projects
None yet
Development

No branches or pull requests

4 participants