Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete_old_mam_messages causes spike in mnesia table size #1161

Closed
mathiasertl opened this issue Jun 19, 2016 · 8 comments
Closed

delete_old_mam_messages causes spike in mnesia table size #1161

mathiasertl opened this issue Jun 19, 2016 · 8 comments
Assignees

Comments

@mathiasertl
Copy link
Contributor

What version of ejabberd are you using?

16.04

What operating system (version) are you using?

Debian 8

How did you install ejabberd (source, package, distribution)?

jabber.at Packages, which already includes some patches from @weiss (see debian/patches/mod_mam* in https://github.com/jabber-at/ejabberd/tree/master/debian/patches).

What did not work as expected? Are there error messages in the log? What
was the unexpected behavior? What was the expected result?

We have enabled mod_mam and have a cron-jobs deleting old messages:

20 5    * * *   ejabberd        ejabberdctl --no-timeout delete_old_mam_messages chat 21
25 5    * * *   ejabberd        ejabberdctl --no-timeout delete_old_mam_messages groupchat 21

When starting with a fresh mod_mam table, soon after 21 days have past (23 this time), delete_old_mam_messages chat triggers a surge in the table size. The table increases in size from ~120 MB to 2 GB (mnesias table limit) before even the groupchat command executes.. This has happened a few times to us already, it not only breaks MAM, it also brakes ejabberdctl dump, thus disabling our backups.

It seems to me @beherit describes similar issues in #1065.

@mathiasertl mathiasertl changed the title delete_old_mam_messages delete_old_mam_messages causes spike in mnesia table size Jun 19, 2016
@weiss
Copy link
Member

weiss commented Jun 19, 2016

So it obviously didn't help to use transaction context for writing the archive_msg table.

I failed to reproduce the issue by transmitting a few messages while running the delete_old_mam_messages command, so I guess it's a timing issue which is more easily triggered on a server with real load. I could try again with Tsung or so.

Either way, maybe we should just declare the Mnesia backend as unsupported for MAM. It's currently not usable in production, IMO.

@beherit
Copy link

beherit commented Jun 20, 2016

@weiss I think You need to transmitting a lot of messages not only few ;) For me problem doesn't exist anymore but I don't have on my server so many registered users as jabber.at.

@badlop
Copy link
Member

badlop commented Jun 20, 2016

weiss, I think your previous commit improved the situation when there are many users with few messages, but it still breaks when a user has a lot of messages. Can you give this a try?

diff --git a/src/mod_mam_mnesia.erl b/src/mod_mam_mnesia.erl
index 10b98da..2bcc6fb 100644
--- a/src/mod_mam_mnesia.erl
+++ b/src/mod_mam_mnesia.erl
@@ -51,31 +51,23 @@ remove_user(LUser, LServer) ->
 remove_room(_LServer, LName, LHost) ->
     remove_user(LName, LHost).

-delete_old_messages(global, TimeStamp, Type) ->
-    delete_old_user_messages(mnesia:dirty_first(archive_msg), TimeStamp, Type).
-
-delete_old_user_messages('$end_of_table', _TimeStamp, _Type) ->
-    ok;
-delete_old_user_messages(User, TimeStamp, Type) ->
+delete_old_messages(_, TimeStamp, Type) ->
+    F2 = fun(Record, _Acc) ->
+        #archive_msg{timestamp = MsgTS, type = MsgType} = Record,
+        case MsgTS >= TimeStamp orelse (Type /= all andalso Type /= MsgType) of
+            false ->
+            mnesia:delete_object(Record);
+            true ->
+            ok
+        end
+    end,
     F = fun() ->
-       Msgs = mnesia:read(archive_msg, User),
-       Keep = lists:filter(
-            fun(#archive_msg{timestamp = MsgTS,
-                     type = MsgType}) ->
-                MsgTS >= TimeStamp orelse (Type /= all andalso
-                               Type /= MsgType)
-            end, Msgs),
-       if length(Keep) < length(Msgs) ->
-           mnesia:delete({archive_msg, User}),
-           lists:foreach(fun(Msg) -> mnesia:write(Msg) end, Keep);
-          true ->
-           ok
-       end
+       mnesia:write_lock_table(archive_msg),
+       mnesia:foldl(F2, ok, archive_msg)
    end,
     case mnesia:transaction(F) of
    {atomic, ok} ->
-       delete_old_user_messages(mnesia:dirty_next(archive_msg, User),
-                    TimeStamp, Type);
+       ok;
    {aborted, Err} ->
        ?ERROR_MSG("Cannot delete old MAM messages: ~s", [Err]),
        Err

@cromain cromain self-assigned this Jun 20, 2016
@cromain cromain added this to the ejabberd 16.06 milestone Jun 20, 2016
@cromain
Copy link
Contributor

cromain commented Jun 20, 2016

the most big issue i see here, despite the mam should use sql anyway, is that the mnesia table is disc_only and bag. this is very not suited for that. i'll provide a small patch to perform the cleanup in memory for better performances with mnesia. there are also maybe some possibles optimizations such as badlop's one.

@weiss
Copy link
Member

weiss commented Jun 20, 2016

@badlop

Can you give this a try?

That's basically the way I did it initially (just without the write lock). It was way slower (it took hours instead of seconds/minutes), especially for the case where few users have many messages. Maybe each user's bag is rewritten on every delete_object call, or something (I didn't check the code).

Either way, I have no idea why the table is trashed. It does feel like an issue with disc_only tables of type bag. I guess those aren't used much in practice. So I can well imagine @cromain's suggestion might work around the issue.

@cromain
Copy link
Contributor

cromain commented Jun 21, 2016

I suggest this change to perform the cleanup in memory, then get the bag back to disc_only for normal operation.

@cromain cromain closed this as completed Jun 21, 2016
@weiss
Copy link
Member

weiss commented Jun 23, 2016

@badlop

Can you give this a try?

That's basically the way I did it initially (just without the write lock). It was way slower (it took hours instead of seconds/minutes), especially for the case where few users have many messages. Maybe each user's bag is rewritten on every delete_object call, or something (I didn't check the code).

The situation may well be different now that we're deleting old messages in memory, of course.

@lock
Copy link

lock bot commented Jun 11, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants