Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in process_request eval block: Child server process timed out! #23

Closed
Synchro opened this issue Jan 29, 2019 · 12 comments
Closed

Comments

@Synchro
Copy link

Synchro commented Jan 29, 2019

I'm having trouble with my mail server (built with https://mailinabox.email v0.4 and using spampd version 2.42) timing out when it gets a large volume of messages (not that much - hundreds per hour), the server has load of < 0.1, 99% idle, but is showing this in logs:

Jan 29 19:02:06 mail2 spampd[27504]: WARNING!! Error in process_request eval block: Child server process timed out!

It is sending delay messages like this:

This is the mail system at host mail.example.com.

####################################################################
# THIS IS A WARNING ONLY.  YOU DO NOT NEED TO RESEND YOUR MESSAGE. #
####################################################################

Your message could not be delivered for more than 3 hour(s).
It will be retried until it is 2 day(s) old.

For further assistance, please send mail to postmaster.

If you do so, please include this problem report. You can
delete your own text from the attached returned message.

                  The mail system

<andrew@example.net>: conversation with 127.0.0.1[127.0.0.1] timed out
   while receiving the initial server greeting
Reporting-MTA: dns; mail.example.com
X-Postfix-Queue-ID: 54D7761235
X-Postfix-Sender: rfc822; info@example.com
Arrival-Date: Tue, 29 Jan 2019 13:45:07 +0000 (UTC)

Final-Recipient: rfc822; user@example.net
Original-Recipient: rfc822;user@example.net
Action: delayed
Status: 4.4.2
Diagnostic-Code: X-Postfix; conversation with 127.0.0.1[127.0.0.1] timed out
   while receiving the initial server greeting

It's notable that this is an internal message, being sent from one user to another within the same mail server, so no external traffic is involved. I found this report which appears to be the same problem from many years ago. I really don't know where to look for what might be causing this, but it appears to be inside spampd. Any ideas?

@MasterPCUK
Copy link

MasterPCUK commented Feb 13, 2019

+1
I have the same issue multiple times per day, and my mail server processes only about 50 emails per hour. Load is 0.00 0.03 for most part.

Feb 13 14:24:04 box spampd[20325]: WARNING!! Error in process_request eval block: Child server process timed out!

As far as I can see it is only happening with when it scans emails with a rather large attachments. ( ~6MB pdf )

Ubuntu Server 18.04

@mpaperno
Copy link
Owner

We probably need more info to find out where it is actually hanging. That timeout message should be coming only after the --childtimeout setting time which is 6 minutes by default. Within that time is also where the mail is actually run through SpamAssassin, which has its own timeout of --satimeout and by default is 285s (15s less than the Postfix default 5 minute smtp_data_done_timeout).

If any of this can't complete before the destination server gives up listening (smtp_data_done_timeout et. al. in the case of Postfix), it will close the connection and you should see a different error in the logs. But if the destination has longer timeouts, it could be that SpamPD is timing out first.

After the quoted error about the timeout, do either of you see any other spampd errors, perhaps about closing sockets? What about anything from the Postfix servers (they must log some kind of timeout also?).

Now in theory if it times out during the SA checks (quite possible if there are network checks or whatnot), you should be seeing a slightly different timeout message. But the processes are nested and the timeouts are all pretty close to each other, so it's possible SA is timing out but it's not obvious.

You could try disabling all SA network checks and see if that helps. If you have external plugins working with SA then maybe disable those as well. If the mail volume is low enough and this happens frequently, enabling debug logging might prove to be useful.

conversation with 127.0.0.1[127.0.0.1] timed out
while receiving the initial server greeting

I don't know why it would timeout during the HELO phase, if that's what this is saying. Only thing that comes to mind is if the destination is trying to do reverse lookup or some other kind of long-running task whenever SpamPD connects to it. Remember SpamPD is just a proxy, especially during HELO it's just relaying the server messages, so if either of the servers hangs up then it will look like an SpamPD timeout.

As far as I can see it is only happening with when it scans emails with a rather large attachments. ( ~6MB pdf )

Seems like a good clue that perhaps SA is running slowly (or some other resource contention, disk or whatever). What do you have set for --maxsize? Default is 64KB.

@Synchro
Copy link
Author

Synchro commented Feb 13, 2019

I don't think this is to do with anything remote because I'm still seeing this while sending between accounts on the same server, for example when webmaster@ forwards to real mailbox names in the same domain. That should rule out DNS. I also added the source address to mynetworks in postfix, which I think should exempt messages from SA processing anyway, but that's not helping.

I'm also not seeing any correlation with large messages or attachments. The things that are being held up most often (though probably only because our server gets lots of them) are simple notifications, no more than a few kb for the entire message.

What I have noticed a correlation with is high message counts - these emails are often outputs from dev environments with lots of monitoring, so message counts get large, and the queue expands - however, as far as I can see the queue only expands because of this hang-up, and I can't think of anything to stop them being delivered more or less instantly since they should be exempted from SA, the server is doing nothing else, and has capacity to burn.

One other clue is that whatever has changed, it happened with the 0.40 release - our inbound traffic pattern has not changed, but this problem only started after we migrated to this release. Of course we also had to switch server and OS, so there may be many other factors at work, which doesn't help.

@mpaperno
Copy link
Owner

I also added the source address to mynetworks in postfix, which I think should exempt messages from SA processing anyway, but that's not helping.

If it's still routing through SpamPD then it's not being excluded. Once it's "in" SpamPD then it is getting run through SA regardless of any Postfix settings.

It should save a lot of resources if you can exclude all those small internal notices from being scanned.

it happened with the 0.40 release

Coming from which version? I don't think anything fundamental has changed in a long time, but might be worth a look.

@Synchro
Copy link
Author

Synchro commented Feb 13, 2019

Our notifications tend to come in waves, leading to long queues. Here are some grabs from munin, all from the same time period. First the mail queue, showing a typical deferred backlog - it's not very big (250 messages), but takes the entire day to clear:

postfix_mailqueue-day

At the same time, the server is doing nothing:
diskstats_utilization-day
cpu-day
memory-day
Network usage mirrors normal user activity, no peaks that correlate with the queue:
fw_packets-day

It should save a lot of resources if you can exclude all those small internal notices from being scanned.

I could understand that helping if the server was maxed out and thrashing, but it's doing very little. This makes me think that it has to be an internal breakage/misconfig, not a resource problem.

Any ideas how to prevent passing messages through SA, either altogether (receiving some spam is better than these delays!), or via an allowed source list, given that the postfix tweak doesn't seem to work?

We upgraded via the approved route, from 0.30 - apart from this issue it all went very smoothly.

@mpaperno
Copy link
Owner

Yea that delay looks like something systematic, not basic resources. Something is throttling delivery, perhaps artificially (eg. greylisting).

At this point you really need to narrow down the source of the delay, whether it really has anything to do with SpamPD and/or SA. Turn on --debug in spampd. Where does the process get to before it hangs? Is SA even being invoked yet, or is it just at HELO/greeting stage? If the latter then I'm guessing the delay has nothing to do with SPD or SA at all.

I don't know anything about MiaB or what the setup is (except the quick look I had just now... never knew they used SPD). So I can't give you any advice there. I see they use Dovecot LDA on the other side of SpamPD, which I also know nothing about. I also don't see from their diagram how outgoing or internal mail is being scanned (via which path). Seems like there would be instructions somewhere for disabling scanning of internal/outgoing mail in their setup, as that is a pretty common requirement (a waste of time if all internal mail is trusted anyway).

Seems like a lot of users... is there a forum or something you could post in? Are there other reports of this? Their GH repo site has too many open issues for me to scan through but a search on "spampd" doesn't bring up anything relevant.

@MasterPCUK Are you also using Mail in a Box?

@Synchro
Copy link
Author

Synchro commented Feb 15, 2019

Apologies - I was getting mixed up between my various issues - I should have posted most of what I wrote above in the mailinabox tracker rather than here, but your input is much appreciated!

I think I have managed to enable debug logging in spampd, so I'll wait and see if it logs anything interesting.

@MasterPCUK
Copy link

@mpaperno Yes I am also using Mail In A Box.
I will also enable --debug to see if that can point me in the right direction.

Also, it does seem to happen with any size of emails, so my assumption that it is only happening with emails over 6mb is false.

@Synchro
Copy link
Author

Synchro commented Feb 15, 2019

OK, I have debug logging that covers this error:

Feb 15 13:12:54 mail postfix/lmtp[15534]: 05C8D6120B: to=<andrew@example.net>, orig_to=<webmaster@example.org >, relay=127.0.0.1[127.0.0.1]:10025, delay=7623, delays=7016/307/300/0, dsn=4.4.2, status=deferred (conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting)
Feb 15 13:12:54 mail postfix/lmtp[15534]: 05C8D6120B: to=<marcus@example.net>, orig_to=<webmaster@example.org>, relay=127.0.0.1[127.0.0.1]:10025, delay=7623, delays=7016/307/300/0, dsn=4.4.2, status=deferred (conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting)
Feb 15 13:13:07 mail spampd[14943]: WARNING!! Error in process_request eval block: Child server process timed out!
Feb 15 13:13:08 mail spampd[24831]: Starting "1" children
Feb 15 13:13:08 mail spampd[16918]: Child Preforked (16918)
Feb 15 13:13:08 mail spampd[16918]: 2019/02/15-13:13:08 CONNECT TCP Peer: "[127.0.0.1]:49390" Local: "[127.0.0.1]:10025"
Feb 15 13:13:08 mail spampd[16918]: Initiated Server
Feb 15 13:13:08 mail postfix/lmtp[15539]: 205D76121E: to=<marcus@example.net>, orig_to=<errors@example.com>, relay=127.0.0.1[127.0.0.1]:10025, delay=6570, delays=5950/2.1/252/366, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)

Then there's this, which looks odd - spampd spawns some child processes, then kills them?

Feb 15 13:13:08 mail spampd[16918]: Initiated Client
Feb 15 13:13:08 mail spampd[16918]: smtp_server state: 'started'
Feb 15 13:13:08 mail spampd[16918]: Closed connections
Feb 15 13:13:08 mail spampd[16918]: 2019/02/15-13:13:08 CONNECT TCP Peer: "[127.0.0.1]:49402" Local: "[127.0.0.1]:10025"
Feb 15 13:13:08 mail spampd[16918]: Initiated Server
Feb 15 13:13:08 mail spampd[16918]: Initiated Client
Feb 15 13:13:08 mail spampd[16918]: smtp_server state: 'started'
Feb 15 13:13:08 mail spampd[16918]: Closed connections
Feb 15 13:13:08 mail spampd[16918]: 2019/02/15-13:13:08 CONNECT TCP Peer: "[127.0.0.1]:49406" Local: "[127.0.0.1]:10025"
Feb 15 13:13:08 mail spampd[16918]: Initiated Server
Feb 15 13:13:08 mail spampd[16918]: Initiated Client
Feb 15 13:13:08 mail spampd[16918]: smtp_server state: 'started'
Feb 15 13:13:08 mail spampd[16918]: Closed connections
Feb 15 13:13:08 mail spampd[16918]: 2019/02/15-13:13:08 CONNECT TCP Peer: "[127.0.0.1]:49418" Local: "[127.0.0.1]:10025"
Feb 15 13:13:08 mail spampd[16918]: Initiated Server
Feb 15 13:13:08 mail spampd[16918]: Initiated Client
Feb 15 13:13:08 mail spampd[16918]: smtp_server state: 'started'
Feb 15 13:13:08 mail spampd[16918]: Closed connections

@mpaperno
Copy link
Owner

Interesting... but need more of the first type of log. Would be good to see the whole lifetime of a spampd child (the ID in spampd[xxxx], starting with Starting "1" Children), and also the lifetime of a delivery attempt (eg. from where it starts to send 205D76121E until it times out at 13:13:08).

This only tells us that a child died due to timeout and was respawned (which we already knew). The sending server gives up at this point because the socket was closed when the original child died. Where it shows a new connection after the "Child Preforked" message, I think that's a new connection entirely, which then disconnects a little bit later (for whatever reason).

Interesting that this one died apparently during data transfer, vs. the others say during initial greeting. Would perhaps be useful to see both types logged.

As for the 2nd chunk... I have no idea. Looks like something is connecting to spampd/destination server and then disconnecting right away. spampd is spawning server/client pairs there (for the relay), not new children. But each child has a limit to how many connections it processes before it is killed and respawned (--maxrequests setting, default 20). If this connect/disconnect happens a lot, might be good to increase that limit (the default is probably from 2003 :) ).

@Synchro
Copy link
Author

Synchro commented Feb 15, 2019

Here are all the log entries relating to that process ID:

Feb 15 12:55:57 mail spampd[14943]: Child Preforked (14943)
Feb 15 13:01:58 mail spampd[14943]: 2019/02/15-13:01:58 CONNECT TCP Peer: "[127.0.0.1]:49182" Local: "[127.0.0.1]:10025"
Feb 15 13:01:58 mail spampd[14943]: Initiated Server
Feb 15 13:01:58 mail spampd[14943]: Initiated Client
Feb 15 13:01:58 mail spampd[14943]: smtp_server state: 'started'
Feb 15 13:01:58 mail spampd[14943]: smtp_server state: 'LHLO mail.example.com'
Feb 15 13:01:58 mail spampd[14943]: Destination response: '250-mail.example.com#015#012250-STARTTLS#015#012250-8BITMIME#015#012250-ENHANCEDSTATUSCODES#015#012250 PIPELINING'
Feb 15 13:01:58 mail spampd[14943]: smtp_server state: 'MAIL FROM:<www-data@example.net>'
Feb 15 13:01:58 mail spampd[14943]: Destination response: '250 2.1.0 OK'
Feb 15 13:01:58 mail spampd[14943]: smtp_server state: 'RCPT TO:<andrew@example.com>'
Feb 15 13:01:58 mail spampd[14943]: Destination response: '250 2.1.5 OK'
Feb 15 13:01:58 mail spampd[14943]: smtp_server state: 'RCPT TO:<marcus@example.com>'
Feb 15 13:01:58 mail spampd[14943]: Destination response: '250 2.1.5 OK'
Feb 15 13:01:58 mail spampd[14943]: smtp_server state: 'DATA'
Feb 15 13:01:58 mail spampd[14943]: Destination response: '354 OK'
Feb 15 13:01:58 mail spampd[14943]: smtp_server state: '.'
Feb 15 13:01:58 mail spampd[14943]: processing message <20190215130157.82D8683078@example.net> for <andrew@example.com>,<marcus@example.com>
Feb 15 13:02:02 mail spampd[14943]: Returned from checking by SpamAssassin
Feb 15 13:02:02 mail spampd[14943]: Rewriting mail using SpamAssassin
Feb 15 13:02:02 mail spampd[14943]: clean message <20190215130157.82D8683078@example.net> (-1.90/5.00) from <www-data@example.net> for <andrew@example.com>,<marcus@example.com> in 3.23s, 32186 bytes.
Feb 15 13:02:02 mail spampd[14943]: Finished sending DATA
Feb 15 13:02:02 mail spampd[14943]: Destination response: '250 2.0.0 <andrew@example.com> cEFuK0a4ZlyOPAAAcuvPBw Saved'
Feb 15 13:02:02 mail spampd[14943]: smtp_server state: 'QUIT'
Feb 15 13:02:02 mail spampd[14943]: Destination response: '250 2.0.0 <marcus@example.com> cEFuK0a4ZlyOPAAAcuvPBw:2 Saved'
Feb 15 13:07:02 mail spampd[14943]: Closed connections
Feb 15 13:07:02 mail spampd[14943]: 2019/02/15-13:07:02 CONNECT TCP Peer: "[127.0.0.1]:49238" Local: "[127.0.0.1]:10025"
Feb 15 13:07:02 mail spampd[14943]: Initiated Server
Feb 15 13:07:02 mail spampd[14943]: Initiated Client
Feb 15 13:07:02 mail spampd[14943]: smtp_server state: 'started'
Feb 15 13:07:02 mail spampd[14943]: smtp_server state: 'LHLO mail.example.com'
Feb 15 13:07:02 mail spampd[14943]: Destination response: '250-mail.example.com#015#012250-STARTTLS#015#012250-8BITMIME#015#012250-ENHANCEDSTATUSCODES#015#012250 PIPELINING'
Feb 15 13:07:02 mail spampd[14943]: smtp_server state: 'MAIL FROM:<bounces@mail.example.org> BODY=8BITMIME'
Feb 15 13:07:02 mail spampd[14943]: Destination response: '250 2.1.0 OK'
Feb 15 13:07:02 mail spampd[14943]: smtp_server state: 'RCPT TO:<andrew@example.com>'
Feb 15 13:07:02 mail spampd[14943]: Destination response: '250 2.1.5 OK'
Feb 15 13:07:02 mail spampd[14943]: smtp_server state: 'RCPT TO:<marcus@example.com>'
Feb 15 13:07:02 mail spampd[14943]: Destination response: '250 2.1.5 OK'
Feb 15 13:07:02 mail spampd[14943]: smtp_server state: 'DATA'
Feb 15 13:07:02 mail spampd[14943]: Destination response: '354 OK'
Feb 15 13:07:02 mail spampd[14943]: smtp_server state: '.'
Feb 15 13:07:02 mail spampd[14943]: processing message <StnIltQb5r0SdfBvbmB31rBT53Co3WJ1Qs0ovlKQq8@smm.im> for <andrew@example.com>,<marcus@example.com>
Feb 15 13:07:07 mail spampd[14943]: Returned from checking by SpamAssassin
Feb 15 13:07:07 mail spampd[14943]: Rewriting mail using SpamAssassin
Feb 15 13:07:07 mail spampd[14943]: clean message <StnIltQb5r0SdfBvbmB31rBT53Co3WJ1Qs0ovlKQq8@smm.im> (-1.90/5.00) from <bounces@mail.example.org> for <andrew@example.com>,<marcus@example.com> in 5.64s, 1226 bytes.
Feb 15 13:07:07 mail spampd[14943]: Finished sending DATA
Feb 15 13:07:07 mail spampd[14943]: Destination response: '250 2.0.0 <andrew@example.com> 0G6ADna5ZlziPgAAcuvPBw Saved'
Feb 15 13:13:07 mail spampd[14943]: WARNING!! Error in process_request eval block: Child server process timed out!

That last line shows a large time offset, supporting the timeout message.

Here are the entries relating to the transaction ID:

Feb 15 11:23:38 mail postfix/smtpd[4516]: 205D76121E: client=mail4.example.net[93.93.128.140]
Feb 15 11:23:38 mail postfix/cleanup[5026]: 205D76121E: message-id=<StnIltQb5r0SdfBvbmB31rBT53Co3WJ1Qs0ovlKQq8@smm.im>
Feb 15 11:23:38 mail opendmarc[13613]: 205D76121E: example.net fail
Feb 15 11:23:38 mail postfix/qmgr[23309]: 205D76121E: from=<bounces@mail.example.net>, size=1151, nrcpt=2 (queue active)
Feb 15 11:30:17 mail postfix/error[5995]: 205D76121E: to=<andrew@example.com>, orig_to=<errors@example.net>, relay=none, delay=399, delays=0.17/399/0/0.02, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)
Feb 15 11:30:17 mail postfix/error[5995]: 205D76121E: to=<marcus@example.com>, orig_to=<errors@example.net>, relay=none, delay=399, delays=0.17/399/0/0.03, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)
Feb 15 11:37:47 mail postfix/qmgr[23309]: 205D76121E: from=<bounces@mail.example.net>, size=1151, nrcpt=2 (queue active)
Feb 15 11:45:57 mail postfix/error[7554]: 205D76121E: to=<andrew@example.com>, orig_to=<errors@example.net>, relay=none, delay=1339, delays=849/490/0/0.01, dsn=4.4.2, status=deferred (delivery temporarily suspended: conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting)
Feb 15 11:45:57 mail postfix/error[7554]: 205D76121E: to=<marcus@example.com>, orig_to=<errors@example.net>, relay=none, delay=1339, delays=849/490/0/0.02, dsn=4.4.2, status=deferred (delivery temporarily suspended: conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting)
Feb 15 12:12:47 mail postfix/qmgr[23309]: 205D76121E: from=<bounces@mail.example.net>, size=1151, nrcpt=2 (queue active)
Feb 15 12:12:47 mail postfix/error[10331]: 205D76121E: to=<andrew@example.com>, orig_to=<errors@example.net>, relay=none, delay=2949, delays=2949/0.02/0/0.01, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)
Feb 15 12:12:47 mail postfix/error[10331]: 205D76121E: to=<marcus@example.com>, orig_to=<errors@example.net>, relay=none, delay=2949, delays=2949/0.02/0/0.02, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)
Feb 15 13:02:47 mail postfix/qmgr[23309]: 205D76121E: from=<bounces@mail.example.net>, size=1151, nrcpt=2 (queue active)
Feb 15 13:07:07 mail postfix/lmtp[15539]: 205D76121E: to=<andrew@example.com>, orig_to=<errors@example.net>, relay=127.0.0.1[127.0.0.1]:10025, delay=6210, delays=5950/2.1/252/5.7, dsn=2.0.0, status=sent (250 2.0.0 <andrew@example.com> 0G6ADna5ZlziPgAAcuvPBw Saved)
Feb 15 13:13:08 mail postfix/lmtp[15539]: 205D76121E: to=<marcus@example.com>, orig_to=<errors@example.net>, relay=127.0.0.1[127.0.0.1]:10025, delay=6570, delays=5950/2.1/252/366, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)

@mpaperno
Copy link
Owner

👍

The child process log shows a success and then the failure. In this case spampd is waiting for the destination to ack both recipients (code) with a 250 response. You can see it happen in the first transaction, but on the 2nd one for some reason the destination never acks the 2nd recipient.

And yea the 2nd part also shows how at the end one recipient got through and the other didn't.

This doesn't explain any "timed out while receiving the initial server greeting" errors, but I'm guessing it's something similar. Nor does it explain why the destination (Dovecot?) appears to be hanging, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants