New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate email notifications after 4.1.0 update #23727
Comments
There is a still open report for the rc1 of 4.1.0 too: #23198 |
This looks like #23198 indeed, though I have no idea what's happening in that issue. What's your mail setup? Do you have errors showing up in sidekiq? |
We use an external mailserver and on this site nothing was changed. So we know it is a result of the update. This is the only mail related error I see: Feb 20 16:02:03 worker01 bundle[1527793]: [ActiveJob] [ActionMailer::MailDeliveryJob] [be001321-1857-40ca-aec7-f232f1eb0c70] Error performing ActionMailer::MailDeliveryJob (Job ID: be001321-1857-40ca-aec7-f232f1eb0c70) from Sidekiq(mailers) in 6184.43ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)): But I see a connection on the mailserver and everything looks ok on this site. Maybe the timing of the disconnect from the mastodon worker is a bit off. Is there a timeout for the mail job? If so it could be a problem with the postscreen which delays the initial EHLO 2 to 6 seconds to block early speakers. Maybe the message is transferred already but mastodon thinks its a timeout and so it is retried multiple times. |
A little over 6 seconds is indeed long and I could see that triggering a timeout. I don't remember any change in that area, though, but I'll investigate. The last release that worked as expected was 4.0.2, right? |
I believe I found it! |
Jup.
That sounds indeed like the source of the problem. Could be change it to something like 10 seconds to prevent this problem? |
Can you try #23750 and see if it fixes your issue? |
Deployed. I will give you feedback later today (CET) or tomorrow. Thanks for the fast patch ❤️ |
Sorry doesn't solve the problem :( Same error in the log lines. |
Thanks for reporting. Have you restarted all sidekiq processes after applying the patch? What is the timeout in the logs now? |
Now it's 10 seconds. Feb 21 10:02:45 worker01 bundle[2908560]: [ActiveJob] [ActionMailer::MailDeliveryJob] [4e20ddbf-0194-475a-8377-2e458b545c42] Error performing ActionMailer::MailDeliveryJob (Job ID: 4e20ddbf-0194-475a-8377-2e458b545c42) from Sidekiq(mailers) in 10204.98ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)): I don't get why it thinks there is a timeout. On the Mailserver side everything looks good and there are multiple accounts on this mailserver and everything works fine. Which versions of the mail gem does mastodon use? I can't find it in the gem file or anywhere else. |
Could it be a behaviour change from the Ruby 3.1 support in |
Mastodon 4.1.0 uses
Interesting. Try changing to a even higher timeout? Or set it to |
For the record we are still using: ruby 3.0.4 |
The question is could I do a manual downgrade to the version in 4.0.2 to see if the problem is gone? Would help to pinpoint the source of the problem. |
Ruby 3.0.4 has |
FYI: Just restarted the workers with a 60s timeout and until now there is no error in the log.
Again I need to know the versions used by 4.0.2. I'm not that experienced in ruby to know where to find this information. |
In Mastodon 4.0.2, there was no explicit dependency to Mastodon 4.1.0 updated the
Yes, you could try: In general we do not recommend messing with your dependencies though, so please remember you made this change and undo it when we have sorted out the issue. 2.7.1 already had support for timeouts but no default one, so you could try a shorter timeout and see if the issue still occurs. |
I reduced the timeout from 60s down to 20s and it still looks like the problem is solved. So maybe 20s is a better default value than 10s. But I can check later if going down to 15s is an option too if yo want. No downgrade or any other change done until now. |
Thanks for the report! It would be interesting to try with 15s. It's also surprising it takes more than 10s in the first place. It's also surprising that this would not prevent the mail from being sent, but only the connection to cleanly close. As far as your own configuration is involved, I would try to avoid the delay on the SMTP server's end if you can, since that delay will affect the job duration anyway. |
Ok, it looks like a 15s timeout works too. But if I go down to 10 seconds for a second test, I see the problem again within less than a minute. I could deactivate the postscreen check for the host that adds 2 to 6 seconds of delay or at all, but this is a setup not only I use, so it would be better to add a sensible timeout that is bigger than the default 5s to prevent others from running into the same problem. I would suggest 15s or for more of a safety margin 20s. Still I don't know why the 4 seconds left, if we take the 10s timeout, are not enough there is no real reason on the mailserver site, which is indicated by the fact that the mail is send properly. As it is with 5s timeout most of the time. So there has to be a reason why the connection isn't closed in time leading the code to assume the mail wasn't send properly. So I would say we've found a solution for the problem but not the cause yet. I will dig a little deeper with more debug logging in the next few days. |
Sorry, I forgot that mastodon uses the submission port (587) not 25 and postscreen isn't active at all on this port because all users are authenticated. So there is no delay from that. I'm currently digging trough the logs but need more time but I thought is is a interesting finding for the moment. |
Ok I did a loooot of debugging and I could pinpoint the problem to the part in the submission process where the mail content is actually transferred to the server. There was a delay of 10 seconds as if the client needed 10 seconds to transfer the mail. And I can't find any reason (in the docs and elsewhere on the internet) why there should be a delay without any other reason than a slow client. See the log:
I tried to find the source of this delay and was surprised finding that my mail client was also delayed but 5 seconds. And this one shouldn't be slow for any reason. So I checked another mail server with the same config and surprise, no delay at all. I restarted the postfix process multiple times in the debugging process, no change, still delayed. Then I saw that the other mail server had a slightly different virtual hardware config that should have no impact on the problem I was observing. But after the reboot everything was fast on the mail server, no delay at all. So I changed back the hardware difference for difference but being back at the old config everything was still working without any delay. I have to admit, that this is absolutely frustrating having the problem gone after a reboot and no idea what was the cause. So I changed back the timeout to the default 5 seconds and for the time writing this comment there was no new error and there was no delay in the logs. This could be the end of the issue. A strange problem on the mail server made visible by the new introduced timeout in a dependency of mastodon. But with my experience I would like to wait a few more days without problems until closing this issue. |
Thank you for your investigations! So this seems to be a mail server issue, although I have no idea what would be causing it. Considering the annoying failure mode and the fact that it affected at least one other person, I'd still consider #23750 worthy of inclusion (though the annoying failure mode is for the |
I just checked again and there was no new error in the logs over the last 12h so I think we can say it is solved. And I agree with your suggestion that a read_timeout of 15 or 20 seconds is still a good idea and also that open_timeout could be lowered back to 5 seconds. |
Steps to reproduce the problem
Expected behaviour
Only one email per notification issue
Actual behaviour
Multiple emails, varying amounts
Detailed description
The behaviour started with the 4.1.0 update.
Specifications
Mastodon 4.1.0
The text was updated successfully, but these errors were encountered: