New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MUC rooms in error state #3881
Comments
If you try with |
Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it. |
@licaon-kter I tried with ejabberdctl. Still stuck up in this loop
@prefiks How to check hibernation status of room? I tried the following:
|
@logicwonder can you get the exit code for the destroy command? |
Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc? You should be able to clean those records with something like that:
in debug console (just replace host at end). |
@licaon-kter The exit code was 0 @prefiks - This is a cluster with 3 nodes. I tried the script in the debug shell. After execution the issue is not fixed.
Error log with mod_muc: 2022-08-04 19:43:19.668717+05:30 [error] <0.16826.6542>@ejabberd_sql:check_error/2:1236 SQL query 'Q106720625' at {mod_muc_sql,{105,9}} failed: <<"timed out">> |
This command do delete only room local to node where it was executed, if those rooms are supposed to live on other nodes, this will not fix it. It may be problem with synching that data between nodes (maybe there was problem with connection when that room was deleted on other node?). Could you see if calling any of get_room_options on different node works or returns different result? |
The issue was fixed after executing the script in all nodes. get_room_options started working for these rooms in all nodes. |
I am getting the following error for a room that got hibernated. I have enabled a hibernation_timeout of 24 hrs.
I tried executing the command in three nodes in eJabberd cluster the same response was received. Is this the behaviour of hibernated rooms. How can I bring the rooms out of this state? |
It should be unhibernated, probably there is entry in online rooms table for that room that is no longer valid (it should be removed when room hibernates), so we don't attempt to restore it thinking it's still alive. Probably command that i posted above will fix that, but it would be nice to know why those rooms entries do still appear in that table... |
The script worked. I made a mistake in service name. My understanding was that the room process for the hiberating room shall be brought up whenever there is administration/activity on the room. Is hibernation applicable for rooms where user are subscribed to messages (mucsub) were user does not join the room for sending.receiving messages? Can you explain how room hibernating work? I couldn't find much resources. |
So room will be unhibernated if someone sends message/iq/presence request or when any action is done on it from ejabberdctl (at least i think i updated all commands to do that). We don't hibernate rooms that have active participants (i.e. any where there was someone that sent |
Thanks @prefiks for clarifying the concept. Could it be a bug with hibernated that the room is not waking up when I try subscribe API call or get_room_options command. I have only mucsub based room subscription without room joining, would that be related to the issue? |
I started facing similar issue when trying to delete a room ejabberd version: 22.05 Using PostgreSQL as SQL DB Steps done:
Also tried running the following snippet suggested by @prefiks in all my three nodes in cluster. But the room is still in some error state and not completely deleted. I am not finding anything in error.log
Please help me in resolving the issue |
Could you try executing those commands for your room and see what they return?
|
I got error executing these commands (ejabberd@xmpp1.local)7> mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>). (ejabberd@xmpp1.local)6> rpc:pinfo(mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>)). |
Ah, right it's only available in 22.10, so let's try this:
|
Thanks for the response. I restarted one of the nodes and then I was able to delete the room. I shall try the script the next time when similar issue is faced. Thanks for you support. I got the following response for the above command: {<0.17927.3300>, |
Hi, is there any task to do regarding this issue, or can it get closed? |
I have faced the same issue when tried to run @prefiks can you please explain me what that script do?
Also I need to understand why this situation occurs and how can I catch it before the end users will face it in the client side? |
Most likelly, tables keeping info about active rooms got desynched between cluster nodes (probably there was problem with network between nodes when information about room changes happened, and there is stalled info about room somewhere). This script tries to sync rooms table state with what actual rooms are running. |
@prefiks thanks for the response. Is there any way I can be aware about such situation before the client experiences it? And also, regarding the network problems you have mentioned, we use AWS EKS who (as we think) not likely to have such network problems. Do you know about such problems in EKS? |
Hi, I have this problem every 2 days in my 3 nodes cluster. Now there is a muc that returns error when I run |
And also, when I run this: I get this: |
You could try running this on each node, it possibly could help: |
Thanks @prefiks But what causes this situation and how can we avoid it or at least, intercept it immediately when it occurs? |
@prefiks In my experience, the inconsistency between cluster nodes is bound to happen either because of network or other reasons. Could you please suggest a cursory check (an Erlang script) that can detect such inconsistencies in a cluster. This could be of great help and may be effective to support the issues raised in future. |
This is most likely result of network issues between node clusters, probably at some node got dropped from a cluster, when other nodes couldn't reach it, and it could happen that operations execute in that time aren't properly propagated when connection get restored. Hard to tell when desync like that happens, it probably would require comparing table content of all nodes but this gets tricky since those tables can change underneath. Maybe just checking if size of all tables match can work in most cases? |
Environment
Errors from error.log/crash.log
Bug description
I am unable to get room options for an already existing room using get _room_options API. It gives error Code: 400 "Conference room does not exist" as response.
The room records are available in muc_room table in PostgreSQL DB.
Now when I try to delete the room using destroy_room API, I get 0 (success) response.
Then I tried creating the room again using create_room_with_opts API. And now I get the error "Room already exists"
It looks like the room is in an error state in Mnesia. I have this issue with 30 to 40 rooms. Recently I have enabled message subscription (MucSub) to rooms. How to debug/resolve the issue?
The text was updated successfully, but these errors were encountered: