delete_old_mam_messages causes spike in mnesia table size #1161

mathiasertl · 2016-06-19T14:59:11Z

What version of ejabberd are you using?

16.04

What operating system (version) are you using?

Debian 8

How did you install ejabberd (source, package, distribution)?

jabber.at Packages, which already includes some patches from @weiss (see debian/patches/mod_mam* in https://github.com/jabber-at/ejabberd/tree/master/debian/patches).

What did not work as expected? Are there error messages in the log? What
was the unexpected behavior? What was the expected result?

We have enabled mod_mam and have a cron-jobs deleting old messages:

20 5    * * *   ejabberd        ejabberdctl --no-timeout delete_old_mam_messages chat 21
25 5    * * *   ejabberd        ejabberdctl --no-timeout delete_old_mam_messages groupchat 21

When starting with a fresh mod_mam table, soon after 21 days have past (23 this time), delete_old_mam_messages chat triggers a surge in the table size. The table increases in size from ~120 MB to 2 GB (mnesias table limit) before even the groupchat command executes.. This has happened a few times to us already, it not only breaks MAM, it also brakes ejabberdctl dump, thus disabling our backups.

It seems to me @beherit describes similar issues in #1065.

The text was updated successfully, but these errors were encountered:

weiss · 2016-06-19T19:55:24Z

So it obviously didn't help to use transaction context for writing the archive_msg table.

I failed to reproduce the issue by transmitting a few messages while running the delete_old_mam_messages command, so I guess it's a timing issue which is more easily triggered on a server with real load. I could try again with Tsung or so.

Either way, maybe we should just declare the Mnesia backend as unsupported for MAM. It's currently not usable in production, IMO.

beherit · 2016-06-20T13:57:35Z

@weiss I think You need to transmitting a lot of messages not only few ;) For me problem doesn't exist anymore but I don't have on my server so many registered users as jabber.at.

badlop · 2016-06-20T15:12:06Z

weiss, I think your previous commit improved the situation when there are many users with few messages, but it still breaks when a user has a lot of messages. Can you give this a try?

diff --git a/src/mod_mam_mnesia.erl b/src/mod_mam_mnesia.erl
index 10b98da..2bcc6fb 100644
--- a/src/mod_mam_mnesia.erl
+++ b/src/mod_mam_mnesia.erl
@@ -51,31 +51,23 @@ remove_user(LUser, LServer) ->
 remove_room(_LServer, LName, LHost) ->
     remove_user(LName, LHost).

-delete_old_messages(global, TimeStamp, Type) ->
-    delete_old_user_messages(mnesia:dirty_first(archive_msg), TimeStamp, Type).
-
-delete_old_user_messages('$end_of_table', _TimeStamp, _Type) ->
-    ok;
-delete_old_user_messages(User, TimeStamp, Type) ->
+delete_old_messages(_, TimeStamp, Type) ->
+    F2 = fun(Record, _Acc) ->
+        #archive_msg{timestamp = MsgTS, type = MsgType} = Record,
+        case MsgTS >= TimeStamp orelse (Type /= all andalso Type /= MsgType) of
+            false ->
+            mnesia:delete_object(Record);
+            true ->
+            ok
+        end
+    end,
     F = fun() ->
-       Msgs = mnesia:read(archive_msg, User),
-       Keep = lists:filter(
-            fun(#archive_msg{timestamp = MsgTS,
-                     type = MsgType}) ->
-                MsgTS >= TimeStamp orelse (Type /= all andalso
-                               Type /= MsgType)
-            end, Msgs),
-       if length(Keep) < length(Msgs) ->
-           mnesia:delete({archive_msg, User}),
-           lists:foreach(fun(Msg) -> mnesia:write(Msg) end, Keep);
-          true ->
-           ok
-       end
+       mnesia:write_lock_table(archive_msg),
+       mnesia:foldl(F2, ok, archive_msg)
    end,
     case mnesia:transaction(F) of
    {atomic, ok} ->
-       delete_old_user_messages(mnesia:dirty_next(archive_msg, User),
-                    TimeStamp, Type);
+       ok;
    {aborted, Err} ->
        ?ERROR_MSG("Cannot delete old MAM messages: ~s", [Err]),
        Err

cromain · 2016-06-20T15:23:55Z

the most big issue i see here, despite the mam should use sql anyway, is that the mnesia table is disc_only and bag. this is very not suited for that. i'll provide a small patch to perform the cleanup in memory for better performances with mnesia. there are also maybe some possibles optimizations such as badlop's one.

weiss · 2016-06-20T16:18:10Z

@badlop

Can you give this a try?

That's basically the way I did it initially (just without the write lock). It was way slower (it took hours instead of seconds/minutes), especially for the case where few users have many messages. Maybe each user's bag is rewritten on every delete_object call, or something (I didn't check the code).

Either way, I have no idea why the table is trashed. It does feel like an issue with disc_only tables of type bag. I guess those aren't used much in practice. So I can well imagine @cromain's suggestion might work around the issue.

cromain · 2016-06-21T08:46:04Z

I suggest this change to perform the cleanup in memory, then get the bag back to disc_only for normal operation.

weiss · 2016-06-23T09:03:50Z

@badlop

Can you give this a try?

That's basically the way I did it initially (just without the write lock). It was way slower (it took hours instead of seconds/minutes), especially for the case where few users have many messages. Maybe each user's bag is rewritten on every delete_object call, or something (I didn't check the code).

The situation may well be different now that we're deleting old messages in memory, of course.

lock · 2019-06-11T12:21:48Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mathiasertl changed the title ~~delete_old_mam_messages~~ delete_old_mam_messages causes spike in mnesia table size Jun 19, 2016

cromain self-assigned this Jun 20, 2016

cromain added this to the ejabberd 16.06 milestone Jun 20, 2016

cromain pushed a commit that referenced this issue Jun 21, 2016

Avoid cleanup on bag when disc_only, switch in memory (#1161)

b2f53fb

cromain closed this as completed Jun 21, 2016

lock bot locked as resolved and limited conversation to collaborators Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete_old_mam_messages causes spike in mnesia table size #1161

delete_old_mam_messages causes spike in mnesia table size #1161

mathiasertl commented Jun 19, 2016

weiss commented Jun 19, 2016

beherit commented Jun 20, 2016

badlop commented Jun 20, 2016

cromain commented Jun 20, 2016

weiss commented Jun 20, 2016 •

edited

Loading

cromain commented Jun 21, 2016

weiss commented Jun 23, 2016

lock bot commented Jun 11, 2019

delete_old_mam_messages causes spike in mnesia table size #1161

delete_old_mam_messages causes spike in mnesia table size #1161

Comments

mathiasertl commented Jun 19, 2016

weiss commented Jun 19, 2016

beherit commented Jun 20, 2016

badlop commented Jun 20, 2016

cromain commented Jun 20, 2016

weiss commented Jun 20, 2016 • edited Loading

cromain commented Jun 21, 2016

weiss commented Jun 23, 2016

lock bot commented Jun 11, 2019

weiss commented Jun 20, 2016 •

edited

Loading