MUC rooms in error state #3881

logicwonder · 2022-08-05T10:01:16Z

Environment

ejabberd version: 22.05
Erlang version: 12.3.1
OS: Linux (Debian)
Installed from: source

Errors from error.log/crash.log

2022-08-05 13:46:15.144600+05:30 [info] <0.14919.7002>@mod_http_api:log/3:511 API call get_room_options [{<<"name">>,
                            <<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
                           {<<"service">>,<<"conference.xxx.xxx.xxx">>}] from ::ffff:127.0.0.1:61944
2022-08-05 13:46:15.145137+05:30 [error] <0.14919.7002>@mod_http_api:handle/4:259 REST API Error: get_room_options([{<<"name">>,
                                   <<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
                                  {<<"service">>,
                                   <<"conference.xxx.xxx.xxx">>}]) -> error:{badmatch,
                                                                                  {error,
                                                                                   notfound}} [{mod_muc_admin,
                                                                                                get_room_config,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/mod_muc_admin.erl"},
                                                                                                 {line,
                                                                                                  926}]},
                                                                                               {mod_muc_admin,
                                                                                                get_room_options,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/mod_muc_admin.erl"},
                                                                                                 {line,
                                                                                                  1237}]},
                                                                                               {mod_http_api,
                                                                                                handle2,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  269}]},
                                                                                               {mod_http_api,
                                                                                                handle,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  230}]},
                                                                                               {mod_http_api,
                                                                                                perform_call,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  190}]},
                                                                                               {mod_http_api,
                                                                                                process,
                                                                                                2,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  143}]},
                                                                                               {ejabberd_http,
                                                                                                process,
                                                                                                2,
                                                                                                [{file,
                                                                                                  "src/ejabberd_http.erl"},
                                                                                                 {line,
                                                                                                  373}]},
                                                                                               {ejabberd_http,
                                                                                                process_request,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/ejabberd_http.erl"},
                                                                                                 {line,
                                                                                                  496}]}]

Bug description

I am unable to get room options for an already existing room using get _room_options API. It gives error Code: 400 "Conference room does not exist" as response.
The room records are available in muc_room table in PostgreSQL DB.

Now when I try to delete the room using destroy_room API, I get 0 (success) response.
Then I tried creating the room again using create_room_with_opts API. And now I get the error "Room already exists"

It looks like the room is in an error state in Mnesia. I have this issue with 30 to 40 rooms. Recently I have enabled message subscription (MucSub) to rooms. How to debug/resolve the issue?

The text was updated successfully, but these errors were encountered:

licaon-kter · 2022-08-05T10:12:35Z

If you try with ejabberdctl ... it works?

prefiks · 2022-08-05T10:27:40Z

Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.

logicwonder · 2022-08-05T14:01:39Z

@licaon-kter I tried with ejabberdctl. Still stuck up in this loop

ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx
ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx
{error,"Room already exists"}
ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx
ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx
{error,"Room already exists"}

@prefiks How to check hibernation status of room? I tried the following:

ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
in function mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
in call from erpc:execute_call/4 (erpc.erl, line 392)

licaon-kter · 2022-08-05T14:13:21Z

@logicwonder can you get the exit code for the destroy command? echo $?

prefiks · 2022-08-05T14:20:55Z

Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?

You should be able to clean those records with something like that:

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

in debug console (just replace host at end).

logicwonder · 2022-08-05T16:16:34Z

@licaon-kter The exit code was 0

@prefiks - This is a cluster with 3 nodes. I tried the script in the debug shell. After execution the issue is not fixed.

Erlang/OTP 24 [erts-12.3.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Eshell V12.3.1  (abort with ^G)

(ejabberd@node1.myserver)1> lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.xxxx.xxxx.xxx">>)).
ok
(ejabberd@node1.myserver)2> 
BREAK: (a)bort (A)bort with dump (c)ontinue (p)roc info (i)nfo
       (l)oaded (v)ersion (k)ill (D)b-tables (d)istribution
^C
ejabberd@node1:~$ ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392)
ejabberd@node1:~$ echo $?
1
ejabberd@node1:~$ ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx xxxxx.xxxx.xxx
{error,"Room already exists"}
ejabberd@node1:~$ ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx 
ejabberd@node1:~$ echo $?
0

Error log with mod_muc:

2022-08-04 19:43:19.668717+05:30 [error] <0.16826.6542>@ejabberd_sql:check_error/2:1236 SQL query 'Q106720625' at {mod_muc_sql,{105,9}} failed: <<"timed out">>
2022-08-04 19:43:33.855798+05:30 [error] <0.21662.6298>@ejabberd_sql:check_error/2:1236 SQL query 'Q113641781' at {mod_muc_sql,{120,23}} failed: <<"timed out">>
2022-08-04 21:14:34.091632+05:30 [error] <0.6030.6811>@ejabberd_sql:check_error/2:1236 SQL query 'Q5148888' at {mod_muc_sql,{166,23}} failed: <<"timed out">>

prefiks · 2022-08-05T16:25:14Z

This command do delete only room local to node where it was executed, if those rooms are supposed to live on other nodes, this will not fix it. It may be problem with synching that data between nodes (maybe there was problem with connection when that room was deleted on other node?). Could you see if calling any of get_room_options on different node works or returns different result?

logicwonder · 2022-08-06T05:05:24Z

The issue was fixed after executing the script in all nodes. get_room_options started working for these rooms in all nodes.
Thanks @prefiks and @licaon-kter for your instant replies and the script to fix the issue.

logicwonder · 2022-08-16T12:05:13Z

Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.

I am getting the following error for a room that got hibernated. I have enabled a hibernation_timeout of 24 hrs.

> ejabberdctl get_room_options f09b2d05eb73a582 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392)

I tried executing the command in three nodes in eJabberd cluster the same response was received.

Is this the behaviour of hibernated rooms. How can I bring the rooms out of this state?

prefiks · 2022-08-16T12:15:01Z

It should be unhibernated, probably there is entry in online rooms table for that room that is no longer valid (it should be removed when room hibernates), so we don't attempt to restore it thinking it's still alive. Probably command that i posted above will fix that, but it would be nice to know why those rooms entries do still appear in that table...

logicwonder · 2022-08-16T13:41:31Z

The script worked. I made a mistake in service name.

My understanding was that the room process for the hiberating room shall be brought up whenever there is administration/activity on the room. Is hibernation applicable for rooms where user are subscribed to messages (mucsub) were user does not join the room for sending.receiving messages? Can you explain how room hibernating work? I couldn't find much resources.

prefiks · 2022-08-16T13:48:08Z

So room will be unhibernated if someone sends message/iq/presence request or when any action is done on it from ejabberdctl (at least i think i updated all commands to do that). We don't hibernate rooms that have active participants (i.e. any where there was someone that sent <presence/> to it). We will however hibernate rooms that have only muc-sub subscribers - we will unhibernate them on any action. In general it should be transparent if room is hibernated or not (with one exception - sending disco items to muc service will not list hibernated rooms - only active rooms will be shown, ideally i would like to not have this, but we don't store info that would be required for that in easily accessible maner).

logicwonder · 2022-08-16T14:03:44Z

Thanks @prefiks for clarifying the concept. Could it be a bug with hibernated that the room is not waking up when I try subscribe API call or get_room_options command. I have only mucsub based room subscription without room joining, would that be related to the issue?

logicwonder · 2022-12-14T07:35:24Z

I started facing similar issue when trying to delete a room

ejabberd version: 22.05
Erlang version: 12.3.1
OS: Linux (Debian)
Installed from: source
3 node cluster

Using PostgreSQL as SQL DB

Steps done:

Deleted a room using the admin command ejabberdctl destroy room. There are no errors and the room record got deleted from the SQL table muc_room
Tried to create that room again with admin command create_room the following message is shown:

{error,"Room already exists"}

Tried running get_room_options for the room and got the following error:

Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392)

Also tried running the following snippet suggested by @prefiks in all my three nodes in cluster. But the room is still in some error state and not completely deleted. I am not finding anything in error.log

Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?

You should be able to clean those records with something like that:
lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).
in debug console (just replace host at end).

Please help me in resolving the issue

prefiks · 2022-12-15T10:38:29Z

Could you try executing those commands for your room and see what they return?

mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>).
rpc:pinfo(mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>)).

logicwonder · 2022-12-16T05:43:50Z

I got error executing these commands

(ejabberd@xmpp1.local)7> mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>).
** exception error: undefined function mod_muc_admin:get_room_pid/2

(ejabberd@xmpp1.local)6> rpc:pinfo(mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>)).
** exception error: undefined function mod_muc_admin:get_room_pid/2

prefiks · 2022-12-16T08:33:11Z

Ah, right it's only available in 22.10, so let's try this:

(fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"room">>, <<"conference.server">>).

logicwonder · 2022-12-17T10:37:17Z

Thanks for the response. I restarted one of the nodes and then I was able to delete the room. I shall try the script the next time when similar issue is faced.

Thanks for you support.

I got the following response for the above command:

{<0.17927.3300>,
[{current_function,{p1_fsm,collect_messages,3}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,0},
{links,[<0.11423.0>]},
{dictionary,[{'$ancestors',['mod_muc_room_sup_xxx.xxx.xx',
'mod_muc_sup_xxx.xxx.xx',ejabberd_gen_mod_sup,
ejabberd_sup,<0.11264.0>]},
{'$internal_queue_len',0},
{rand_seed,{#{bits => 58,jump => #Fun<rand.3.92093067>,
next => #Fun<rand.0.92093067>,type => exsss,
uniform => #Fun<rand.1.92093067>,
uniform_n => #Fun<rand.2.92093067>},
[256314510256950933|203838542858434267]}},
{'$initial_call',{mod_muc_room,init,1}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.11263.0>},
{total_heap_size,10971},
{heap_size,4185},
{stack_size,18},
{reductions,757200},
{garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,36}]},
{suspending,[]}]}

badlop · 2023-06-21T12:00:05Z

Hi, is there any task to do regarding this issue, or can it get closed?

relbraun · 2023-07-12T08:47:12Z

I have faced the same issue when tried to run mod_muc_admin:get_room_options. I have also 3 nodes on cluster and no hibernation option defined in the .yml file (as far as I understand, it means no hibernation in any case).

@prefiks can you please explain me what that script do?

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

Also I need to understand why this situation occurs and how can I catch it before the end users will face it in the client side?

prefiks · 2023-07-12T10:49:31Z

Most likelly, tables keeping info about active rooms got desynched between cluster nodes (probably there was problem with network between nodes when information about room changes happened, and there is stalled info about room somewhere). This script tries to sync rooms table state with what actual rooms are running.

relbraun · 2023-07-12T12:38:37Z

@prefiks thanks for the response. Is there any way I can be aware about such situation before the client experiences it?

And also, regarding the network problems you have mentioned, we use AWS EKS who (as we think) not likely to have such network problems. Do you know about such problems in EKS?

relbraun · 2023-11-29T15:27:19Z

Hi, I have this problem every 2 days in my 3 nodes cluster. Now there is a muc that returns error when I run mod_muc_admin:get_room_options on each node.
Any idea how to solve it? And moreover, how can I prevent this situation?
@prefiks please maybe you have any idea?

relbraun · 2023-11-29T18:51:54Z

And also, when I run this:
(fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"roomid">>, <<"conference.myswerver.com">>).

I get this: {<31537.5036.1>,undefined}

prefiks · 2023-11-30T09:14:29Z

You could try running this on each node, it possibly could help:
{code}
lists:foreach(fun({R, H, P}) ->case node(P)/=node() andalso rpc:pinfo(P, status) == undefined of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).
{code}

relbraun · 2023-11-30T10:25:42Z

Thanks @prefiks

But what causes this situation and how can we avoid it or at least, intercept it immediately when it occurs?

logicwonder · 2023-11-30T10:29:01Z

@prefiks In my experience, the inconsistency between cluster nodes is bound to happen either because of network or other reasons. Could you please suggest a cursory check (an Erlang script) that can detect such inconsistencies in a cluster. This could be of great help and may be effective to support the issues raised in future.

prefiks · 2023-11-30T11:09:25Z

This is most likely result of network issues between node clusters, probably at some node got dropped from a cluster, when other nodes couldn't reach it, and it could happen that operations execute in that time aren't properly propagated when connection get restored. Hard to tell when desync like that happens, it probably would require comparing table content of all nodes but this gets tricky since those tables can change underneath. Maybe just checking if size of all tables match can work in most cases?
{code}
ejabberd_cluster:multicall(mnesia, table_info, [muc_online_room, size]).
{code}
This should return {[SizeA, SizeB, SizeC, ...], [NotReachableNodeA, ...]}, if Size* are different, then likelly tables are desynced.

logicwonder closed this as completed Aug 6, 2022

logicwonder reopened this Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MUC rooms in error state #3881

MUC rooms in error state #3881

logicwonder commented Aug 5, 2022 •

edited by prefiks

licaon-kter commented Aug 5, 2022

prefiks commented Aug 5, 2022

logicwonder commented Aug 5, 2022

licaon-kter commented Aug 5, 2022

prefiks commented Aug 5, 2022

logicwonder commented Aug 5, 2022

prefiks commented Aug 5, 2022

logicwonder commented Aug 6, 2022

logicwonder commented Aug 16, 2022

prefiks commented Aug 16, 2022

logicwonder commented Aug 16, 2022

prefiks commented Aug 16, 2022 •

edited

logicwonder commented Aug 16, 2022

logicwonder commented Dec 14, 2022

prefiks commented Dec 15, 2022

logicwonder commented Dec 16, 2022

prefiks commented Dec 16, 2022

logicwonder commented Dec 17, 2022

badlop commented Jun 21, 2023

relbraun commented Jul 12, 2023

prefiks commented Jul 12, 2023

relbraun commented Jul 12, 2023 •

edited

relbraun commented Nov 29, 2023

relbraun commented Nov 29, 2023

prefiks commented Nov 30, 2023

relbraun commented Nov 30, 2023

logicwonder commented Nov 30, 2023

prefiks commented Nov 30, 2023

MUC rooms in error state #3881

MUC rooms in error state #3881

Comments

logicwonder commented Aug 5, 2022 • edited by prefiks

Environment

Errors from error.log/crash.log

Bug description

licaon-kter commented Aug 5, 2022

prefiks commented Aug 5, 2022

logicwonder commented Aug 5, 2022

licaon-kter commented Aug 5, 2022

prefiks commented Aug 5, 2022

logicwonder commented Aug 5, 2022

prefiks commented Aug 5, 2022

logicwonder commented Aug 6, 2022

logicwonder commented Aug 16, 2022

prefiks commented Aug 16, 2022

logicwonder commented Aug 16, 2022

prefiks commented Aug 16, 2022 • edited

logicwonder commented Aug 16, 2022

logicwonder commented Dec 14, 2022

prefiks commented Dec 15, 2022

logicwonder commented Dec 16, 2022

prefiks commented Dec 16, 2022

logicwonder commented Dec 17, 2022

badlop commented Jun 21, 2023

relbraun commented Jul 12, 2023

prefiks commented Jul 12, 2023

relbraun commented Jul 12, 2023 • edited

relbraun commented Nov 29, 2023

relbraun commented Nov 29, 2023

prefiks commented Nov 30, 2023

relbraun commented Nov 30, 2023

logicwonder commented Nov 30, 2023

prefiks commented Nov 30, 2023

logicwonder commented Aug 5, 2022 •

edited by prefiks

prefiks commented Aug 16, 2022 •

edited

relbraun commented Jul 12, 2023 •

edited