Duplicate email notifications after 4.1.0 update #23727

rixx · 2023-02-19T13:04:17Z

Steps to reproduce the problem

Any action that triggers a notification, e.g. a new report
Receive multiple emails (up to 7 so far) for the new report

Expected behaviour

Only one email per notification issue

Actual behaviour

Multiple emails, varying amounts

Detailed description

The behaviour started with the 4.1.0 update.

Specifications

Mastodon 4.1.0

leahoswald · 2023-02-19T20:12:19Z

There is a still open report for the rc1 of 4.1.0 too: #23198

ClearlyClaire · 2023-02-20T14:31:47Z

This looks like #23198 indeed, though I have no idea what's happening in that issue. What's your mail setup? Do you have errors showing up in sidekiq?

leahoswald · 2023-02-20T15:22:42Z

We use an external mailserver and on this site nothing was changed. So we know it is a result of the update. This is the only mail related error I see:

Feb 20 16:02:03 worker01 bundle[1527793]: [ActiveJob] [ActionMailer::MailDeliveryJob] [be001321-1857-40ca-aec7-f232f1eb0c70] Error performing ActionMailer::MailDeliveryJob (Job ID: be001321-1857-40ca-aec7-f232f1eb0c70) from Sidekiq(mailers) in 6184.43ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)):

But I see a connection on the mailserver and everything looks ok on this site. Maybe the timing of the disconnect from the mastodon worker is a bit off. Is there a timeout for the mail job? If so it could be a problem with the postscreen which delays the initial EHLO 2 to 6 seconds to block early speakers. Maybe the message is transferred already but mastodon thinks its a timeout and so it is retried multiple times.

ClearlyClaire · 2023-02-20T15:40:25Z

A little over 6 seconds is indeed long and I could see that triggering a timeout. I don't remember any change in that area, though, but I'll investigate. The last release that worked as expected was 4.0.2, right?

ClearlyClaire · 2023-02-20T15:56:57Z

I believe I found it!
We are not setting timeouts ourselves, but between Mastodon 4.0.2 and Mastodon 4.1.0, we updated dependencies, which pulled a new version of the mail gem had an undocumented change that went from having no timeout to having a default read timeout of 5 seconds (mikel/mail@4f56234)

leahoswald · 2023-02-20T16:01:00Z

The last release that worked as expected was 4.0.2, right?

Jup.

We are not setting timeouts ourselves, but between Mastodon 4.0.2 and Mastodon 4.1.0, we updated dependencies, which pulled a new version of the mail gem had an undocumented change that went from having no timeout to having a default read timeout of 5 seconds (mikel/mail@4f56234)

That sounds indeed like the source of the problem. Could be change it to something like 10 seconds to prevent this problem?

ClearlyClaire · 2023-02-20T16:08:39Z

Can you try #23750 and see if it fixes your issue?

leahoswald · 2023-02-20T17:16:14Z

Deployed. I will give you feedback later today (CET) or tomorrow. Thanks for the fast patch ❤️

leahoswald · 2023-02-21T08:56:22Z

Sorry doesn't solve the problem :( Same error in the log lines.

ClearlyClaire · 2023-02-21T08:58:33Z

Thanks for reporting. Have you restarted all sidekiq processes after applying the patch? What is the timeout in the logs now?

leahoswald · 2023-02-21T09:07:48Z

Now it's 10 seconds.

Feb 21 10:02:45 worker01 bundle[2908560]: [ActiveJob] [ActionMailer::MailDeliveryJob] [4e20ddbf-0194-475a-8377-2e458b545c42] Error performing ActionMailer::MailDeliveryJob (Job ID: 4e20ddbf-0194-475a-8377-2e458b545c42) from Sidekiq(mailers) in 10204.98ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)):

I don't get why it thinks there is a timeout. On the Mailserver side everything looks good and there are multiple accounts on this mailserver and everything works fine.

Which versions of the mail gem does mastodon use? I can't find it in the gem file or anywhere else.

evgeni · 2023-02-21T09:14:37Z

Could it be a behaviour change from the Ruby 3.1 support in mail, that now pulls in net-smtp directly which might have a problem talking to your SMTP server?

ClearlyClaire · 2023-02-21T09:16:31Z

Which versions of the mail gem does mastodon use? I can't find it in the gem file or anywhere else.

Mastodon 4.1.0 uses mail 2.8.0.1. It's not a direct dependency but gets pulled by actionmailer. You can find the version in Gemfile.lock.

Now it's 10 seconds.

Feb 21 10:02:45 worker01 bundle[2908560]: [ActiveJob] [ActionMailer::MailDeliveryJob] [4e20ddbf-0194-475a-8377-2e458b545c42] Error performing ActionMailer::MailDeliveryJob (Job ID: 4e20ddbf-0194-475a-8377-2e458b545c42) from Sidekiq(mailers) in 10204.98ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)):

Interesting. Try changing to a even higher timeout? Or set it to nil?

leahoswald · 2023-02-21T09:19:21Z

For the record we are still using: ruby 3.0.4

leahoswald · 2023-02-21T09:21:10Z

Mastodon 4.1.0 uses mail 2.8.0.1. It's not a direct dependency but gets pulled by actionmailer. You can find the version in Gemfile.lock.

The question is could I do a manual downgrade to the version in 4.0.2 to see if the problem is gone? Would help to pinpoint the source of the problem.

evgeni · 2023-02-21T09:25:18Z

Ruby 3.0.4 has net/smtp 0.2.1 bundled in (Mastodon 4.1.0 uses 0.3.3, while 4.0.2 used the bundled 0.2.1). 0.2.2 has an interesting changelog entry saying "Replace Timeout.timeout with socket timeout." So I guess downgrading mail and/or net/smtp is a good try.

leahoswald · 2023-02-21T09:29:53Z

FYI: Just restarted the workers with a 60s timeout and until now there is no error in the log.

Ruby 3.0.4 has net/smtp 0.2.1 bundled in. 0.2.2 has an interesting changelog entry saying "Replace Timeout.timeout with socket timeout." So I guess downgrading mail and/or net/smtp is a good try.

Again I need to know the versions used by 4.0.2. I'm not that experienced in ruby to know where to find this information.

ClearlyClaire · 2023-02-21T09:33:14Z

In Mastodon 4.0.2, there was no explicit dependency to net/smtp so it used the built-in one. But it also did not set any timeout, direct or indirect.

Mastodon 4.1.0 updated the mail gem, which introduced a default timeout of 5 seconds, and also added an explicit dependency on net-smtp 0.3.3.

The question is could I do a manual downgrade to the version in 4.0.2 to see if the problem is gone? Would help to pinpoint the source of the problem.

Yes, you could try: bundle add mail --version=2.7.1

In general we do not recommend messing with your dependencies though, so please remember you made this change and undo it when we have sorted out the issue.

2.7.1 already had support for timeouts but no default one, so you could try a shorter timeout and see if the issue still occurs.

leahoswald · 2023-02-21T11:45:36Z

I reduced the timeout from 60s down to 20s and it still looks like the problem is solved. So maybe 20s is a better default value than 10s. But I can check later if going down to 15s is an option too if yo want.

No downgrade or any other change done until now.

ClearlyClaire · 2023-02-21T13:26:13Z

Thanks for the report! It would be interesting to try with 15s. It's also surprising it takes more than 10s in the first place. It's also surprising that this would not prevent the mail from being sent, but only the connection to cleanly close.

As far as your own configuration is involved, I would try to avoid the delay on the SMTP server's end if you can, since that delay will affect the job duration anyway.

leahoswald · 2023-02-21T20:12:19Z

Ok, it looks like a 15s timeout works too. But if I go down to 10 seconds for a second test, I see the problem again within less than a minute. I could deactivate the postscreen check for the host that adds 2 to 6 seconds of delay or at all, but this is a setup not only I use, so it would be better to add a sensible timeout that is bigger than the default 5s to prevent others from running into the same problem. I would suggest 15s or for more of a safety margin 20s.

Still I don't know why the 4 seconds left, if we take the 10s timeout, are not enough there is no real reason on the mailserver site, which is indicated by the fact that the mail is send properly. As it is with 5s timeout most of the time. So there has to be a reason why the connection isn't closed in time leading the code to assume the mail wasn't send properly.

So I would say we've found a solution for the problem but not the cause yet. I will dig a little deeper with more debug logging in the next few days.

leahoswald · 2023-02-21T20:30:09Z

Sorry, I forgot that mastodon uses the submission port (587) not 25 and postscreen isn't active at all on this port because all users are authenticated. So there is no delay from that. I'm currently digging trough the logs but need more time but I thought is is a interesting finding for the moment.

leahoswald · 2023-02-21T22:30:01Z

Ok I did a loooot of debugging and I could pinpoint the problem to the part in the submission process where the mail content is actually transferred to the server. There was a delay of 10 seconds as if the client needed 10 seconds to transfer the mail. And I can't find any reason (in the docs and elsewhere on the internet) why there should be a delay without any other reason than a slow client. See the log:

Feb 21 21:34:18 mx1 postfix/submission/smtpd[1310127]: > worker.example.com[myip]: 354 End data with <CR><LF>.<CR><LF>
Feb 21 21:34:18 mx1 postfix/cleanup[1310131]: CF78425EE3A: message-id=<63f52aca6b911_338c82c3db4206ad@example.com>
[...other client stuff...]
Feb 21 21:34:29 mx1 postfix/submission/smtpd[1310127]: > worker.example.com[myip]: 250 2.0.0 Ok: queued as CF78425EE3A

I tried to find the source of this delay and was surprised finding that my mail client was also delayed but 5 seconds. And this one shouldn't be slow for any reason. So I checked another mail server with the same config and surprise, no delay at all. I restarted the postfix process multiple times in the debugging process, no change, still delayed. Then I saw that the other mail server had a slightly different virtual hardware config that should have no impact on the problem I was observing. But after the reboot everything was fast on the mail server, no delay at all. So I changed back the hardware difference for difference but being back at the old config everything was still working without any delay. I have to admit, that this is absolutely frustrating having the problem gone after a reboot and no idea what was the cause. So I changed back the timeout to the default 5 seconds and for the time writing this comment there was no new error and there was no delay in the logs.

This could be the end of the issue. A strange problem on the mail server made visible by the new introduced timeout in a dependency of mastodon. But with my experience I would like to wait a few more days without problems until closing this issue.

ClearlyClaire · 2023-02-22T10:06:55Z

Thank you for your investigations! So this seems to be a mail server issue, although I have no idea what would be causing it.

Considering the annoying failure mode and the fact that it affected at least one other person, I'd still consider #23750 worthy of inclusion (though the annoying failure mode is for the read_timeout so maybe the open_timeout can be lowered back to 10 or 5 seconds).

leahoswald · 2023-02-22T10:18:39Z

I just checked again and there was no new error in the logs over the last 12h so I think we can say it is solved.

And I agree with your suggestion that a read_timeout of 15 or 20 seconds is still a good idea and also that open_timeout could be lowered back to 5 seconds.

Fixes mastodon#23727

rixx added the bug Something isn't working label Feb 19, 2023

ClearlyClaire mentioned this issue Feb 20, 2023

Change unintended SMTP read timeout from 5 seconds to 20 seconds #23750

Merged

ClearlyClaire added a commit to ClearlyClaire/mastodon that referenced this issue Feb 22, 2023

Change unintended SMTP read timeout from 5 seconds to 20 seconds

b1ff9ea

Fixes mastodon#23727

Gargron closed this as completed in #23750 Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate email notifications after 4.1.0 update #23727

Duplicate email notifications after 4.1.0 update #23727

rixx commented Feb 19, 2023

leahoswald commented Feb 19, 2023

ClearlyClaire commented Feb 20, 2023

leahoswald commented Feb 20, 2023

ClearlyClaire commented Feb 20, 2023

ClearlyClaire commented Feb 20, 2023

leahoswald commented Feb 20, 2023

ClearlyClaire commented Feb 20, 2023

leahoswald commented Feb 20, 2023 •

edited

leahoswald commented Feb 21, 2023 •

edited

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023 •

edited

evgeni commented Feb 21, 2023

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023

leahoswald commented Feb 21, 2023

evgeni commented Feb 21, 2023 •

edited

leahoswald commented Feb 21, 2023

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023 •

edited

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023

leahoswald commented Feb 21, 2023

leahoswald commented Feb 21, 2023

ClearlyClaire commented Feb 22, 2023

leahoswald commented Feb 22, 2023

Duplicate email notifications after 4.1.0 update #23727

Duplicate email notifications after 4.1.0 update #23727

Comments

rixx commented Feb 19, 2023

Steps to reproduce the problem

Expected behaviour

Actual behaviour

Detailed description

Specifications

leahoswald commented Feb 19, 2023

ClearlyClaire commented Feb 20, 2023

leahoswald commented Feb 20, 2023

ClearlyClaire commented Feb 20, 2023

ClearlyClaire commented Feb 20, 2023

leahoswald commented Feb 20, 2023

ClearlyClaire commented Feb 20, 2023

leahoswald commented Feb 20, 2023 • edited

leahoswald commented Feb 21, 2023 • edited

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023 • edited

evgeni commented Feb 21, 2023

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023

leahoswald commented Feb 21, 2023

evgeni commented Feb 21, 2023 • edited

leahoswald commented Feb 21, 2023

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023 • edited

ClearlyClaire commented Feb 21, 2023

leahoswald commented Feb 21, 2023

leahoswald commented Feb 21, 2023

leahoswald commented Feb 21, 2023

ClearlyClaire commented Feb 22, 2023

leahoswald commented Feb 22, 2023

leahoswald commented Feb 20, 2023 •

edited

leahoswald commented Feb 21, 2023 •

edited

leahoswald commented Feb 21, 2023 •

edited

evgeni commented Feb 21, 2023 •

edited

leahoswald commented Feb 21, 2023 •

edited