If a child_fetch child exits non-zero (or any child waited on by wait_children() in fdm.c) this exit status is lost. This is because the return (retcode) on fdm.c:262 is unreachable code. The two return (0) statements on lines 206 and 209 or the fatal() on line 210 will always be what terminates the for (;;) loop. (The fatal() is not the problem, it's the other two.) The fix is to change lines 206 and 209 to return (retcode). retcode starts as 0 and only mutates non-zero (never back to 0) for the duration of the wait_children() call.
Losing a child error can result in the apparent delivery of a message, especially when fdm is used as a local delivery agent.
I noticed this because I recently switched from procmail to fdm and saw a large message get accepted for delivery, was not delivered, and the MTA logs showed no delivery deferral and mail.err showed lines like this:
Apr 5 00:00:24 mymachine fdm[24074]: stdin: failed to resize mail: Cannot allocate memory
Apr 5 00:00:24 mymachine fdm[24074]: stdin: fetching error. aborted
(Or similar; in reproduction sometimes it's the deliver-rewrite.c variant of that message instead of the fetch-stdin.c variant cited above, depending upon config.)
And the mail was lost when an error was hit at this stage because the MTA considered it delivered.
Now, in my case, this was due to a very low memory soft limit. But that shouldn't matter, if it can't be delivered the LDA must exit non-zero.
To reproduce:
% cat genmessage
from email.message import EmailMessage
import smtplib
msg = EmailMessage()
msg.set_content('\n'.join(['a' * 100 for i in range(20000)]))
msg['Subject'] = 'Test'
msg['From'] = 'me@example.com'
msg['To'] = 'me@example.com'
msg.set_content('\n'.join(['a' * 100 for i in range(40000)]))
print(msg)
% python3 genmessage | prlimit --stack=$((2**24)) --data=$((2**24)) --as=$((2**24)) /usr/bin/fdm -vvv -h fetch; echo $?
Running this shows that even though there's an error it exits 0:
...
parent: got message type 2, id 0 from child 3778
parent: sending exit message to child 3778
parent: 2 children, 0 dead children
stdin: fetch loop start
stdin: got message type 2, id 1
stdin: trying (deliver) message 1
stdin: deliver message 1, error
stdin: fetching error. aborted
stdin: 6 file descriptors in use
stdin: 3778: allocated=4526358, freed=4526102, difference=256, peak=4227654
stdin: 3778: mallocs=10420, reallocs=8, frees=10419
stdin: 3778: 1, deliver_rewrite_deliver+0x172: [0x1d0bc08 256: aaaaaaaa]
stdin: 3778: 1 unfreed blocks
stdin: 0 messages processed in 0.542 seconds
stdin: polled for 0.194 seconds (0.194 blocked)
stdin: finished processing. exiting
stdin: sending exit message to parent
stdin: waiting for exit message from parent
parent: got message type 1, id 0 from child 3774
parent: sending exit message to child 3774
parent: 2 children, 0 dead children
parent: child 3778 socket error
parent: 2 children, 0 dead children
stdin: 3 file descriptors in use
stdin: 3774: allocated=284468, freed=284468, difference=0, peak=280964
stdin: 3774: mallocs=92, reallocs=6, frees=92
stdin: 3774: 0 unfreed blocks
parent: waiting for children
parent: child 3774 returned 1
parent: child 3774 died: killing 3778
parent: waiting for children
parent: 1 children, 1 dead children
parent: waiting for children
parent: child 3778 returned 0
parent: waiting for children
parent: waiting for children
parent: finished, total time 0.547 seconds
parent: 3 file descriptors in use
parent: 3773: allocated=104329, freed=104329, difference=0, peak=80876
parent: 3773: mallocs=532, reallocs=64, frees=532
parent: 3773: 0 unfreed blocks
0
With this patch:
diff --git a/fdm.c b/fdm.c
index a35393a..c09edf9 100644
--- a/fdm.c
+++ b/fdm.c
@@ -203,10 +203,10 @@ wait_children(
/* Wait for a child. */
switch (pid = waitpid(WAIT_ANY, &status, flags)) {
case 0:
- return (0);
+ return (retcode);
case -1:
if (errno == ECHILD)
- return (0);
+ return (retcode);
fatal("waitpid failed");
}
@@ -947,5 +947,6 @@ out:
free_strings(&conf.excl);
#endif
+ log_debug2("parent: exit code %d", res);
exit(res);
}
It exits 1:
...
parent: got message type 2, id 0 from child 3913
parent: sending exit message to child 3913
parent: 2 children, 0 dead children
stdin: fetch loop start
stdin: got message type 2, id 1
stdin: trying (deliver) message 1
stdin: deliver message 1, error
stdin: fetching error. aborted
stdin: 0 messages processed in 0.606 seconds
stdin: finished processing. exiting
stdin: sending exit message to parent
stdin: waiting for exit message from parent
parent: got message type 1, id 0 from child 3912
parent: sending exit message to child 3912
parent: 2 children, 0 dead children
parent: waiting for children
parent: child 3912 returned 1
parent: child 3912 died: killing 3913
parent: waiting for children
parent: 1 children, 1 dead children
parent: 1 children, 1 dead children
parent: child 3913 socket error
parent: 1 children, 1 dead children
parent: waiting for children
parent: child 3913 returned 1
parent: waiting for children
parent: waiting for children
parent: finished, total time 0.610 seconds
parent: exit code 1
1
With this patch in place, and used as a LDA, that large mail that got dropped before gets delivery deferred which is what is desired.
This is on Linux i686, but I doubt that matters:
% uname -a
Linux <redacted> 5.10.13-x86-linode161 #1 SMP Thu Feb 4 13:51:28 EST 2021 i686 GNU/Linux
And master HEAD:
% git rev-parse HEAD
d911b0c1121fbdee527d8786cb68f4959d785cff
It appears this bug has been present since 2009 according to history.
I did not examine other types of children for exit code propagation issues.
If a
child_fetchchild exits non-zero (or any child waited on bywait_children()infdm.c) this exit status is lost. This is because thereturn (retcode)on fdm.c:262 is unreachable code. The tworeturn (0)statements on lines 206 and 209 or thefatal()on line 210 will always be what terminates thefor (;;)loop. (The fatal() is not the problem, it's the other two.) The fix is to change lines 206 and 209 toreturn (retcode).retcodestarts as 0 and only mutates non-zero (never back to 0) for the duration of thewait_children()call.Losing a child error can result in the apparent delivery of a message, especially when fdm is used as a local delivery agent.
I noticed this because I recently switched from procmail to fdm and saw a large message get accepted for delivery, was not delivered, and the MTA logs showed no delivery deferral and mail.err showed lines like this:
(Or similar; in reproduction sometimes it's the
deliver-rewrite.cvariant of that message instead of thefetch-stdin.cvariant cited above, depending upon config.)And the mail was lost when an error was hit at this stage because the MTA considered it delivered.
Now, in my case, this was due to a very low memory soft limit. But that shouldn't matter, if it can't be delivered the LDA must exit non-zero.
To reproduce:
Running this shows that even though there's an error it exits 0:
With this patch:
It exits 1:
With this patch in place, and used as a LDA, that large mail that got dropped before gets delivery deferred which is what is desired.
This is on Linux i686, but I doubt that matters:
And
masterHEAD:It appears this bug has been present since 2009 according to history.
I did not examine other types of children for exit code propagation issues.