Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MUC rooms in error state #3881

Open
logicwonder opened this issue Aug 5, 2022 · 28 comments
Open

MUC rooms in error state #3881

logicwonder opened this issue Aug 5, 2022 · 28 comments

Comments

@logicwonder
Copy link

logicwonder commented Aug 5, 2022

Environment

  • ejabberd version: 22.05
  • Erlang version: 12.3.1
  • OS: Linux (Debian)
  • Installed from: source

Errors from error.log/crash.log

2022-08-05 13:46:15.144600+05:30 [info] <0.14919.7002>@mod_http_api:log/3:511 API call get_room_options [{<<"name">>,
                            <<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
                           {<<"service">>,<<"conference.xxx.xxx.xxx">>}] from ::ffff:127.0.0.1:61944
2022-08-05 13:46:15.145137+05:30 [error] <0.14919.7002>@mod_http_api:handle/4:259 REST API Error: get_room_options([{<<"name">>,
                                   <<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
                                  {<<"service">>,
                                   <<"conference.xxx.xxx.xxx">>}]) -> error:{badmatch,
                                                                                  {error,
                                                                                   notfound}} [{mod_muc_admin,
                                                                                                get_room_config,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/mod_muc_admin.erl"},
                                                                                                 {line,
                                                                                                  926}]},
                                                                                               {mod_muc_admin,
                                                                                                get_room_options,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/mod_muc_admin.erl"},
                                                                                                 {line,
                                                                                                  1237}]},
                                                                                               {mod_http_api,
                                                                                                handle2,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  269}]},
                                                                                               {mod_http_api,
                                                                                                handle,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  230}]},
                                                                                               {mod_http_api,
                                                                                                perform_call,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  190}]},
                                                                                               {mod_http_api,
                                                                                                process,
                                                                                                2,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  143}]},
                                                                                               {ejabberd_http,
                                                                                                process,
                                                                                                2,
                                                                                                [{file,
                                                                                                  "src/ejabberd_http.erl"},
                                                                                                 {line,
                                                                                                  373}]},
                                                                                               {ejabberd_http,
                                                                                                process_request,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/ejabberd_http.erl"},
                                                                                                 {line,
                                                                                                  496}]}]

Bug description

I am unable to get room options for an already existing room using get _room_options API. It gives error Code: 400 "Conference room does not exist" as response.
The room records are available in muc_room table in PostgreSQL DB.

Now when I try to delete the room using destroy_room API, I get 0 (success) response.
Then I tried creating the room again using create_room_with_opts API. And now I get the error "Room already exists"

It looks like the room is in an error state in Mnesia. I have this issue with 30 to 40 rooms. Recently I have enabled message subscription (MucSub) to rooms. How to debug/resolve the issue?

@licaon-kter
Copy link
Contributor

If you try with ejabberdctl ... it works?

@prefiks
Copy link
Member

prefiks commented Aug 5, 2022

Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.

@logicwonder
Copy link
Author

@licaon-kter I tried with ejabberdctl. Still stuck up in this loop

ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx
ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx
{error,"Room already exists"}
ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx
ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx
{error,"Room already exists"}

@prefiks How to check hibernation status of room? I tried the following:

ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
in function mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
in call from erpc:execute_call/4 (erpc.erl, line 392)

@licaon-kter
Copy link
Contributor

@logicwonder can you get the exit code for the destroy command? echo $?

@prefiks
Copy link
Member

prefiks commented Aug 5, 2022

Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?

You should be able to clean those records with something like that:

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

in debug console (just replace host at end).

@logicwonder
Copy link
Author

@licaon-kter The exit code was 0

@prefiks - This is a cluster with 3 nodes. I tried the script in the debug shell. After execution the issue is not fixed.

Erlang/OTP 24 [erts-12.3.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Eshell V12.3.1  (abort with ^G)

(ejabberd@node1.myserver)1> lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.xxxx.xxxx.xxx">>)).
ok
(ejabberd@node1.myserver)2> 
BREAK: (a)bort (A)bort with dump (c)ontinue (p)roc info (i)nfo
       (l)oaded (v)ersion (k)ill (D)b-tables (d)istribution
^C
ejabberd@node1:~$ ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392)
ejabberd@node1:~$ echo $?
1
ejabberd@node1:~$ ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx xxxxx.xxxx.xxx
{error,"Room already exists"}
ejabberd@node1:~$ ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx 
ejabberd@node1:~$ echo $?
0

Error log with mod_muc:

2022-08-04 19:43:19.668717+05:30 [error] <0.16826.6542>@ejabberd_sql:check_error/2:1236 SQL query 'Q106720625' at {mod_muc_sql,{105,9}} failed: <<"timed out">>
2022-08-04 19:43:33.855798+05:30 [error] <0.21662.6298>@ejabberd_sql:check_error/2:1236 SQL query 'Q113641781' at {mod_muc_sql,{120,23}} failed: <<"timed out">>
2022-08-04 21:14:34.091632+05:30 [error] <0.6030.6811>@ejabberd_sql:check_error/2:1236 SQL query 'Q5148888' at {mod_muc_sql,{166,23}} failed: <<"timed out">>

@prefiks
Copy link
Member

prefiks commented Aug 5, 2022

This command do delete only room local to node where it was executed, if those rooms are supposed to live on other nodes, this will not fix it. It may be problem with synching that data between nodes (maybe there was problem with connection when that room was deleted on other node?). Could you see if calling any of get_room_options on different node works or returns different result?

@logicwonder
Copy link
Author

The issue was fixed after executing the script in all nodes. get_room_options started working for these rooms in all nodes.
Thanks @prefiks and @licaon-kter for your instant replies and the script to fix the issue.

@logicwonder
Copy link
Author

Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.

I am getting the following error for a room that got hibernated. I have enabled a hibernation_timeout of 24 hrs.

> ejabberdctl get_room_options f09b2d05eb73a582 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392)

I tried executing the command in three nodes in eJabberd cluster the same response was received.

Is this the behaviour of hibernated rooms. How can I bring the rooms out of this state?

@logicwonder logicwonder reopened this Aug 16, 2022
@prefiks
Copy link
Member

prefiks commented Aug 16, 2022

It should be unhibernated, probably there is entry in online rooms table for that room that is no longer valid (it should be removed when room hibernates), so we don't attempt to restore it thinking it's still alive. Probably command that i posted above will fix that, but it would be nice to know why those rooms entries do still appear in that table...

@logicwonder
Copy link
Author

The script worked. I made a mistake in service name.

My understanding was that the room process for the hiberating room shall be brought up whenever there is administration/activity on the room. Is hibernation applicable for rooms where user are subscribed to messages (mucsub) were user does not join the room for sending.receiving messages? Can you explain how room hibernating work? I couldn't find much resources.

@prefiks
Copy link
Member

prefiks commented Aug 16, 2022

So room will be unhibernated if someone sends message/iq/presence request or when any action is done on it from ejabberdctl (at least i think i updated all commands to do that). We don't hibernate rooms that have active participants (i.e. any where there was someone that sent <presence/> to it). We will however hibernate rooms that have only muc-sub subscribers - we will unhibernate them on any action. In general it should be transparent if room is hibernated or not (with one exception - sending disco items to muc service will not list hibernated rooms - only active rooms will be shown, ideally i would like to not have this, but we don't store info that would be required for that in easily accessible maner).

@logicwonder
Copy link
Author

Thanks @prefiks for clarifying the concept. Could it be a bug with hibernated that the room is not waking up when I try subscribe API call or get_room_options command. I have only mucsub based room subscription without room joining, would that be related to the issue?

@logicwonder
Copy link
Author

I started facing similar issue when trying to delete a room

ejabberd version: 22.05
Erlang version: 12.3.1
OS: Linux (Debian)
Installed from: source
3 node cluster

Using PostgreSQL as SQL DB

Steps done:

  1. Deleted a room using the admin command ejabberdctl destroy room. There are no errors and the room record got deleted from the SQL table muc_room
  2. Tried to create that room again with admin command create_room the following message is shown:
{error,"Room already exists"}
  1. Tried running get_room_options for the room and got the following error:
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392) 

Also tried running the following snippet suggested by @prefiks in all my three nodes in cluster. But the room is still in some error state and not completely deleted. I am not finding anything in error.log

Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?

You should be able to clean those records with something like that:

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

in debug console (just replace host at end).

Please help me in resolving the issue

@prefiks
Copy link
Member

prefiks commented Dec 15, 2022

Could you try executing those commands for your room and see what they return?

mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>).
rpc:pinfo(mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>)).

@logicwonder
Copy link
Author

I got error executing these commands

(ejabberd@xmpp1.local)7> mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>).
** exception error: undefined function mod_muc_admin:get_room_pid/2

(ejabberd@xmpp1.local)6> rpc:pinfo(mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>)).
** exception error: undefined function mod_muc_admin:get_room_pid/2

@prefiks
Copy link
Member

prefiks commented Dec 16, 2022

Ah, right it's only available in 22.10, so let's try this:

(fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"room">>, <<"conference.server">>).

@logicwonder
Copy link
Author

Thanks for the response. I restarted one of the nodes and then I was able to delete the room. I shall try the script the next time when similar issue is faced.

Thanks for you support.

I got the following response for the above command:

{<0.17927.3300>,
[{current_function,{p1_fsm,collect_messages,3}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,0},
{links,[<0.11423.0>]},
{dictionary,[{'$ancestors',['mod_muc_room_sup_xxx.xxx.xx',
'mod_muc_sup_xxx.xxx.xx',ejabberd_gen_mod_sup,
ejabberd_sup,<0.11264.0>]},
{'$internal_queue_len',0},
{rand_seed,{#{bits => 58,jump => #Fun<rand.3.92093067>,
next => #Fun<rand.0.92093067>,type => exsss,
uniform => #Fun<rand.1.92093067>,
uniform_n => #Fun<rand.2.92093067>},
[256314510256950933|203838542858434267]}},
{'$initial_call',{mod_muc_room,init,1}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.11263.0>},
{total_heap_size,10971},
{heap_size,4185},
{stack_size,18},
{reductions,757200},
{garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,36}]},
{suspending,[]}]}

@badlop
Copy link
Member

badlop commented Jun 21, 2023

Hi, is there any task to do regarding this issue, or can it get closed?

@relbraun
Copy link

I have faced the same issue when tried to run mod_muc_admin:get_room_options. I have also 3 nodes on cluster and no hibernation option defined in the .yml file (as far as I understand, it means no hibernation in any case).

@prefiks can you please explain me what that script do?

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

Also I need to understand why this situation occurs and how can I catch it before the end users will face it in the client side?

@prefiks
Copy link
Member

prefiks commented Jul 12, 2023

Most likelly, tables keeping info about active rooms got desynched between cluster nodes (probably there was problem with network between nodes when information about room changes happened, and there is stalled info about room somewhere). This script tries to sync rooms table state with what actual rooms are running.

@relbraun
Copy link

relbraun commented Jul 12, 2023

@prefiks thanks for the response. Is there any way I can be aware about such situation before the client experiences it?

And also, regarding the network problems you have mentioned, we use AWS EKS who (as we think) not likely to have such network problems. Do you know about such problems in EKS?

@relbraun
Copy link

Hi, I have this problem every 2 days in my 3 nodes cluster. Now there is a muc that returns error when I run mod_muc_admin:get_room_options on each node.
Any idea how to solve it? And moreover, how can I prevent this situation?
@prefiks please maybe you have any idea?

@relbraun
Copy link

And also, when I run this:
(fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"roomid">>, <<"conference.myswerver.com">>).

I get this: {<31537.5036.1>,undefined}

@prefiks
Copy link
Member

prefiks commented Nov 30, 2023

You could try running this on each node, it possibly could help:
{code}
lists:foreach(fun({R, H, P}) ->case node(P)/=node() andalso rpc:pinfo(P, status) == undefined of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).
{code}

@relbraun
Copy link

Thanks @prefiks

But what causes this situation and how can we avoid it or at least, intercept it immediately when it occurs?

@logicwonder
Copy link
Author

@prefiks In my experience, the inconsistency between cluster nodes is bound to happen either because of network or other reasons. Could you please suggest a cursory check (an Erlang script) that can detect such inconsistencies in a cluster. This could be of great help and may be effective to support the issues raised in future.

@prefiks
Copy link
Member

prefiks commented Nov 30, 2023

This is most likely result of network issues between node clusters, probably at some node got dropped from a cluster, when other nodes couldn't reach it, and it could happen that operations execute in that time aren't properly propagated when connection get restored. Hard to tell when desync like that happens, it probably would require comparing table content of all nodes but this gets tricky since those tables can change underneath. Maybe just checking if size of all tables match can work in most cases?
{code}
ejabberd_cluster:multicall(mnesia, table_info, [muc_online_room, size]).
{code}
This should return {[SizeA, SizeB, SizeC, ...], [NotReachableNodeA, ...]}, if Size* are different, then likelly tables are desynced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants