Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate email notifications after 4.1.0 update #23727

Closed
rixx opened this issue Feb 19, 2023 · 25 comments · Fixed by #23750
Closed

Duplicate email notifications after 4.1.0 update #23727

rixx opened this issue Feb 19, 2023 · 25 comments · Fixed by #23750
Labels
bug Something isn't working

Comments

@rixx
Copy link

rixx commented Feb 19, 2023

Steps to reproduce the problem

  1. Any action that triggers a notification, e.g. a new report
  2. Receive multiple emails (up to 7 so far) for the new report
    2023-02-19T14:02:58+01:00

Expected behaviour

Only one email per notification issue

Actual behaviour

Multiple emails, varying amounts

Detailed description

The behaviour started with the 4.1.0 update.

Specifications

Mastodon 4.1.0

@rixx rixx added the bug Something isn't working label Feb 19, 2023
@leahoswald
Copy link

There is a still open report for the rc1 of 4.1.0 too: #23198

@ClearlyClaire
Copy link
Contributor

This looks like #23198 indeed, though I have no idea what's happening in that issue. What's your mail setup? Do you have errors showing up in sidekiq?

@leahoswald
Copy link

We use an external mailserver and on this site nothing was changed. So we know it is a result of the update. This is the only mail related error I see:

Feb 20 16:02:03 worker01 bundle[1527793]: [ActiveJob] [ActionMailer::MailDeliveryJob] [be001321-1857-40ca-aec7-f232f1eb0c70] Error performing ActionMailer::MailDeliveryJob (Job ID: be001321-1857-40ca-aec7-f232f1eb0c70) from Sidekiq(mailers) in 6184.43ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)):

But I see a connection on the mailserver and everything looks ok on this site. Maybe the timing of the disconnect from the mastodon worker is a bit off. Is there a timeout for the mail job? If so it could be a problem with the postscreen which delays the initial EHLO 2 to 6 seconds to block early speakers. Maybe the message is transferred already but mastodon thinks its a timeout and so it is retried multiple times.

@ClearlyClaire
Copy link
Contributor

A little over 6 seconds is indeed long and I could see that triggering a timeout. I don't remember any change in that area, though, but I'll investigate. The last release that worked as expected was 4.0.2, right?

@ClearlyClaire
Copy link
Contributor

I believe I found it!
We are not setting timeouts ourselves, but between Mastodon 4.0.2 and Mastodon 4.1.0, we updated dependencies, which pulled a new version of the mail gem had an undocumented change that went from having no timeout to having a default read timeout of 5 seconds (mikel/mail@4f56234)

@leahoswald
Copy link

The last release that worked as expected was 4.0.2, right?

Jup.

We are not setting timeouts ourselves, but between Mastodon 4.0.2 and Mastodon 4.1.0, we updated dependencies, which pulled a new version of the mail gem had an undocumented change that went from having no timeout to having a default read timeout of 5 seconds (mikel/mail@4f56234)

That sounds indeed like the source of the problem. Could be change it to something like 10 seconds to prevent this problem?

@ClearlyClaire
Copy link
Contributor

Can you try #23750 and see if it fixes your issue?

@leahoswald
Copy link

leahoswald commented Feb 20, 2023

Deployed. I will give you feedback later today (CET) or tomorrow. Thanks for the fast patch ❤️

@leahoswald
Copy link

leahoswald commented Feb 21, 2023

Sorry doesn't solve the problem :( Same error in the log lines.

@ClearlyClaire
Copy link
Contributor

Thanks for reporting. Have you restarted all sidekiq processes after applying the patch? What is the timeout in the logs now?

@leahoswald
Copy link

leahoswald commented Feb 21, 2023

Now it's 10 seconds.

Feb 21 10:02:45 worker01 bundle[2908560]: [ActiveJob] [ActionMailer::MailDeliveryJob] [4e20ddbf-0194-475a-8377-2e458b545c42] Error performing ActionMailer::MailDeliveryJob (Job ID: 4e20ddbf-0194-475a-8377-2e458b545c42) from Sidekiq(mailers) in 10204.98ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)):

I don't get why it thinks there is a timeout. On the Mailserver side everything looks good and there are multiple accounts on this mailserver and everything works fine.

Which versions of the mail gem does mastodon use? I can't find it in the gem file or anywhere else.

@evgeni
Copy link

evgeni commented Feb 21, 2023

Could it be a behaviour change from the Ruby 3.1 support in mail, that now pulls in net-smtp directly which might have a problem talking to your SMTP server?

@ClearlyClaire
Copy link
Contributor

Which versions of the mail gem does mastodon use? I can't find it in the gem file or anywhere else.

Mastodon 4.1.0 uses mail 2.8.0.1. It's not a direct dependency but gets pulled by actionmailer. You can find the version in Gemfile.lock.

Now it's 10 seconds.

Feb 21 10:02:45 worker01 bundle[2908560]: [ActiveJob] [ActionMailer::MailDeliveryJob] [4e20ddbf-0194-475a-8377-2e458b545c42] Error performing ActionMailer::MailDeliveryJob (Job ID: 4e20ddbf-0194-475a-8377-2e458b545c42) from Sidekiq(mailers) in 10204.98ms: Net::ReadTimeout (Net::ReadTimeout with #Socket:(closed)):

Interesting. Try changing to a even higher timeout? Or set it to nil?

@leahoswald
Copy link

For the record we are still using: ruby 3.0.4

@leahoswald
Copy link

Mastodon 4.1.0 uses mail 2.8.0.1. It's not a direct dependency but gets pulled by actionmailer. You can find the version in Gemfile.lock.

The question is could I do a manual downgrade to the version in 4.0.2 to see if the problem is gone? Would help to pinpoint the source of the problem.

@evgeni
Copy link

evgeni commented Feb 21, 2023

Ruby 3.0.4 has net/smtp 0.2.1 bundled in (Mastodon 4.1.0 uses 0.3.3, while 4.0.2 used the bundled 0.2.1). 0.2.2 has an interesting changelog entry saying "Replace Timeout.timeout with socket timeout." So I guess downgrading mail and/or net/smtp is a good try.

@leahoswald
Copy link

FYI: Just restarted the workers with a 60s timeout and until now there is no error in the log.

Ruby 3.0.4 has net/smtp 0.2.1 bundled in. 0.2.2 has an interesting changelog entry saying "Replace Timeout.timeout with socket timeout." So I guess downgrading mail and/or net/smtp is a good try.

Again I need to know the versions used by 4.0.2. I'm not that experienced in ruby to know where to find this information.

@ClearlyClaire
Copy link
Contributor

In Mastodon 4.0.2, there was no explicit dependency to net/smtp so it used the built-in one. But it also did not set any timeout, direct or indirect.

Mastodon 4.1.0 updated the mail gem, which introduced a default timeout of 5 seconds, and also added an explicit dependency on net-smtp 0.3.3.

The question is could I do a manual downgrade to the version in 4.0.2 to see if the problem is gone? Would help to pinpoint the source of the problem.

Yes, you could try: bundle add mail --version=2.7.1

In general we do not recommend messing with your dependencies though, so please remember you made this change and undo it when we have sorted out the issue.

2.7.1 already had support for timeouts but no default one, so you could try a shorter timeout and see if the issue still occurs.

@leahoswald
Copy link

leahoswald commented Feb 21, 2023

I reduced the timeout from 60s down to 20s and it still looks like the problem is solved. So maybe 20s is a better default value than 10s. But I can check later if going down to 15s is an option too if yo want.

No downgrade or any other change done until now.

@ClearlyClaire
Copy link
Contributor

Thanks for the report! It would be interesting to try with 15s. It's also surprising it takes more than 10s in the first place. It's also surprising that this would not prevent the mail from being sent, but only the connection to cleanly close.

As far as your own configuration is involved, I would try to avoid the delay on the SMTP server's end if you can, since that delay will affect the job duration anyway.

@leahoswald
Copy link

Ok, it looks like a 15s timeout works too. But if I go down to 10 seconds for a second test, I see the problem again within less than a minute. I could deactivate the postscreen check for the host that adds 2 to 6 seconds of delay or at all, but this is a setup not only I use, so it would be better to add a sensible timeout that is bigger than the default 5s to prevent others from running into the same problem. I would suggest 15s or for more of a safety margin 20s.

Still I don't know why the 4 seconds left, if we take the 10s timeout, are not enough there is no real reason on the mailserver site, which is indicated by the fact that the mail is send properly. As it is with 5s timeout most of the time. So there has to be a reason why the connection isn't closed in time leading the code to assume the mail wasn't send properly.

So I would say we've found a solution for the problem but not the cause yet. I will dig a little deeper with more debug logging in the next few days.

@leahoswald
Copy link

Sorry, I forgot that mastodon uses the submission port (587) not 25 and postscreen isn't active at all on this port because all users are authenticated. So there is no delay from that. I'm currently digging trough the logs but need more time but I thought is is a interesting finding for the moment.

@leahoswald
Copy link

Ok I did a loooot of debugging and I could pinpoint the problem to the part in the submission process where the mail content is actually transferred to the server. There was a delay of 10 seconds as if the client needed 10 seconds to transfer the mail. And I can't find any reason (in the docs and elsewhere on the internet) why there should be a delay without any other reason than a slow client. See the log:

Feb 21 21:34:18 mx1 postfix/submission/smtpd[1310127]: > worker.example.com[myip]: 354 End data with <CR><LF>.<CR><LF>
Feb 21 21:34:18 mx1 postfix/cleanup[1310131]: CF78425EE3A: message-id=<63f52aca6b911_338c82c3db4206ad@example.com>
[...other client stuff...]
Feb 21 21:34:29 mx1 postfix/submission/smtpd[1310127]: > worker.example.com[myip]: 250 2.0.0 Ok: queued as CF78425EE3A

I tried to find the source of this delay and was surprised finding that my mail client was also delayed but 5 seconds. And this one shouldn't be slow for any reason. So I checked another mail server with the same config and surprise, no delay at all. I restarted the postfix process multiple times in the debugging process, no change, still delayed. Then I saw that the other mail server had a slightly different virtual hardware config that should have no impact on the problem I was observing. But after the reboot everything was fast on the mail server, no delay at all. So I changed back the hardware difference for difference but being back at the old config everything was still working without any delay. I have to admit, that this is absolutely frustrating having the problem gone after a reboot and no idea what was the cause. So I changed back the timeout to the default 5 seconds and for the time writing this comment there was no new error and there was no delay in the logs.

This could be the end of the issue. A strange problem on the mail server made visible by the new introduced timeout in a dependency of mastodon. But with my experience I would like to wait a few more days without problems until closing this issue.

@ClearlyClaire
Copy link
Contributor

Thank you for your investigations! So this seems to be a mail server issue, although I have no idea what would be causing it.

Considering the annoying failure mode and the fact that it affected at least one other person, I'd still consider #23750 worthy of inclusion (though the annoying failure mode is for the read_timeout so maybe the open_timeout can be lowered back to 10 or 5 seconds).

@leahoswald
Copy link

I just checked again and there was no new error in the logs over the last 12h so I think we can say it is solved.

And I agree with your suggestion that a read_timeout of 15 or 20 seconds is still a good idea and also that open_timeout could be lowered back to 5 seconds.

ClearlyClaire added a commit to ClearlyClaire/mastodon that referenced this issue Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants