Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue stalls when non-SMTP email backend raises error #73

Closed
medmunds opened this issue Nov 8, 2016 · 27 comments
Closed

Queue stalls when non-SMTP email backend raises error #73

medmunds opened this issue Nov 8, 2016 · 27 comments

Comments

@medmunds
Copy link
Contributor

medmunds commented Nov 8, 2016

I'm using django-mailer with a django-anymail backend (specifically, MAILER_EMAIL_BACKEND = "anymail.backends.postmark.PostmarkBackend").

Anymail is not based on smtplib, so it raises its own exceptions for rejected recipients or other errors. Because django-mailer is looking for specific SMTP exceptions, an Anymail exception ends up stalling the queue, rather than deferring the unsendable message. (The exception aborts the send_mail management command, leaving the unsendable message at the top of the prioritization list for the next send.)

What would be a reasonable way to solve this? (I maintain django-anymail, btw.)

  • I don't think it makes sense to add Anymail-specific exceptions to django-mailer. (Because that starts down the slippery slope to adding specific code for all other email backends.)
  • django-mailer could define and catch its own SendMailException (e.g.), in addition to the current SMTP exceptions. Then other packages that want to play nice with django-mailer could raise mailer.SendMailException when django-mailer is installed. (I'm not at all opposed to adding this to Anymail. But it would sort of start the opposite slippery slope, where email backends need to add specific code for any likely queueing wrappers.)
  • django-mailer's send_all could just catch Exception, rather than specific SMTP exceptions. I realize broad catch statements are generally a bad idea, but this seems like a case where it might be appropriate.

Thoughts? I'm happy to submit a PR if one of the django-mailer changes makes sense.

@spookylukey
Copy link
Contributor

Yes, this is a tricky issue.

Another possibility is that we get Django to define and use a SendMailException class. This seems like a reasonable proposal - given Django's support for different mail backends, it seems like there should be a good way of catching mail sending exceptions for any backend. The support Django has so far is "fail silently" mode, which is not always enough.

In fact, the current docs for fail_silently - https://docs.djangoproject.com/en/1.10/topics/email/#send-mail - are incorrect if you are not using the default backend.

As a Django core dev, I would back this proposal. I'm unlikely to have the time to implement it. It could get tricky as well, because it would mean wrapping exceptions from the stdlib with Django's own exceptions - using something like this technique I guess http://stackoverflow.com/questions/3847503/wrapping-exceptions-in-python - but you would also need to dynamically subclass all exceptions, so that existing code catching smtplib exceptions doesn't break.

I would be happy to add some interim solution to django-mailer as well (which would be the only solution if the Django change didn't work out). Probably the last of your suggestions is best, I can see the argument for it.

If the Django changes go through, we could do

try:
    from django.mail import SendMailException
except ImportError:
    SendMailException = Exception

or something.

@medmunds
Copy link
Contributor Author

Hmm...

I suppose another option would be to directly use the smtplib exception classes as the generic EmailBackend exceptions: in the discussion above,SendMailException = smtplib.SMTPException. (And then the existing Django docs would still be correct.:smile:)

But the more I dig into this, it seems like there are at least two distinct types of sending errors:

  • Transient errors, like network connectivity issues. Deferring and then retrying the exact same EmailMessage later is likely to succeed.
  • Permanent errors, like invalid email addresses. If your backend rejected "To: nobody@gmail" on the first send attempt, no amount of retrying is likely make that same EmailMessage go through.

Deferring permanent errors isn't helpful, and can lead to other problems. (E.g., if you use EMAIL_MAX_BATCH or EMAIL_MAX_DEFERRED, you'd eventually get into a state where the number of deferred permanent errors would block retrying later-deferred transient errors.)

So, I might expect send_all to end up something like:

    # (*many* details omitted)
    for message in prioritize():
        try:
            connection.send(message.email)
        except TransientSendError as err:
            message.defer()
            MessageLog.objects.log(message, RESULT_FAILURE, log_message=str(err))
        except Exception as err:  # anything else is "permanent error"
            MessageLog.objects.log(message, RESULT_PERMANENT_FAILURE, log_message=str(err))
            message.delete()  # remove from queue -- *don't* try again later
        else:  # successful send
            MessageLog.objects.log(message, RESULT_SUCCESS)
            message.delete()     

(Though there's actually a third possible behavior in the existing django-mailer code: if the backend throws an unexpected exception, send_all immediately aborts the batch (crashes the send_all management command), without deferring the current message. If you're using the recommended cron scripts, it'll retry that send and continue the batch one minute later. And this turns out to be the ideal way to handle transient network issues with a transactional ESP backend like Anymail.)

I think I'm leaning toward a combination of changing Anymail's exceptions to inherit from smtplib errors where meaningful, and then proposing a PR to django-mailer to split the individual smtplib exceptions between transient and permanent error handling as above.

@spookylukey
Copy link
Contributor

OK, it looks like you've thought this through, please go ahead and I'll look at your PR. I won't have time to work on it myself.

medmunds added a commit to medmunds/django-mailer that referenced this issue Dec 6, 2016
If sending a message raises a "non-retryable" error,
delete the Message from the queue, and note the error
in the MessageLog with a new `RESULT_ERROR` status.
(Retryable errors continue to be deferred and logged with
`RESULT_FAILURE`.)

This addresses two, related problems:
1. Deferring a message with an expected, but non-retryable
   error would over time grow the deferred queue (repeatedly
   hammering the backend with the same unsendable messages).
2. Sending a message with an unhandled error would
   interrupt the send_all, immediately and permanently
   stalling the entire queue.

"Non-retryable" errors are:
* Anything where the original message would need to be altered
  to be sent successfully. (E.g., a sender not recognized
  by your SMTP server, or a malformed recipient address.)
  Covers the first problem.
* Catchall `Exception` for anything not explicitly enumerated
  as a transient, retryable error (like a network issue). The catchall
  covers most errors raised by third-party EmailBackends.
  Addresses the second problem.

Addresses pinax#73.
@medmunds
Copy link
Contributor Author

medmunds commented Dec 7, 2016

OK, I've been struggling with how to fix the original problem, without also breaking some current, (probably-unintended but) valuable behavior. Apologies for the length; specific proposal and questions at the end.

Here's how various sending exceptions are handled now (assuming recommended cron jobs and default settings):

  1. smtplib.SMTPConnectError: the exception is not handled, not recorded in the MessageLog, and send_mail (engine.send_all) crashes -- leaving the current message and all following ones still queued. A minute later, the next send_mail tries to send that same message again.

    For a transient network/server glitch, this is arguably a good outcome. There's nothing wrong with the message itself, so you definitely want to retry sending it later. And a minute is often enough to resolve network problems -- so your mail goes through promptly.

  2. UnicodeDecodeError or any other unexpected exception: same handling as the previous case, but bad outcome: the exception is not handled, not recorded in the MessageLog, and send_mail crashes leaving the current message and all following ones queued. A minute later, the process repeats, and the send fails again on the same message. The send queue is stalled.

    If the problem is with the message content itself, the queue stays stalled, and none of your email will go out. The problematic message needs to be deleted from the queue.

  3. smtplib.SMTPRecipientsRefused: the issue is recorded in the MessageLog, the message is deferred, and send_all continues sending other messages in the queue. Up to 20 minutes later, retry_deferred will re-queue the deferred message for the next send_all, and the process repeats.

    In Python 2, this behavior applies to socket.error and a specific list (though not all) of smtplib.SMTPException errors. Other smtplib errors fall under (1) or (2) above. But in Python 3.4+, this changes: SMTPException becomes a subclass of socket.error, so all smtplib errors get deferred instead.

    Deferral actually catches a mixture of transient problems (socket.error) that are likely to succeed on later retry, as well as message-content-specific issues that won't (retrying won't make "nobody@gmail" a valid recipient). Over time, the deferred queue will fill with the latter, and every 20 minutes you'll hammer your server trying to resend them all. But things will still work, albeit inefficiently. (Well, unless you've set MAILER_EMAIL_MAX_DEFERRED. But that's beyond the scope of this issue.)

Proposed changes

  • A. Explicitly support case (1): catch and log smtplib.SMTPConnectError and SMTPServerDisconnected, and cleanly abort the send_all. (This also normalizes Python 2/3 handling for these exceptions.)

    Question: should socket.error (a.k.a. IOError/OSError) be included in this case?

  • B. Fix case (2): catch all unhandled exceptions, log the error, and delete the message from the queue to prevent the stall.

    Question: Should we also add a MAILER_EMAIL_MAX_ERRORS setting for this case?

  • C. Normalize the Python 2/3 smtplib exception behavior from case (3): catch and defer the base smtplib.SMTPException.

There would be no other changes to send_all exception handling. (In particular, I'm not going to try to distinguish retryable from non-retryable smtplib exceptions, which is non-trivial. They'll all just get deferred.)

By default, third-party EmailBackend exceptions will fall under (B) and delete the unsent message. Backends could opt particular errors into deferred-retry (C) by subclassing smtplib.SMTPException (seems reasonable). Or they could opt into immediate-retry (A) by subclassing smtplib.SMTPConnectError (which seems odd; IOError or OSError seem like better choices here; see question above).

medmunds added a commit to medmunds/django-mailer that referenced this issue Dec 10, 2016
[NOT FOR MERGE yet: see "TODO" for discussion]

In engine.send_all:

* Handle transient connection issues by logging error
  and cleanly exiting send_all. (Previously, crashed
  without logging.)
* Handle all other smtplib.SMTPException by deferring
  message. (Previously, behavior differred between
  Python 2 and 3.)
* Handle all other exceptions by logging error and
  deleting message from send queue. (Previously,
  crashed without logging, leaving possibly-unsendable
  message blocking send queue.)

**Potentially-breaking change:** If you are using a
MAILER_EMAIL_BACKEND other than the default
SMTP EmailBackend, errors from that backend *may*
be handled differently now.

Addresses pinax#73. (And additional discussion there.)
medmunds added a commit to medmunds/django-mailer that referenced this issue Dec 10, 2016
[NOT FOR MERGE yet: see "TODO" comments for discussion]

In engine.send_all:

* Handle transient connection issues by logging failure
  and cleanly exiting send_all. (Previously, crashed
  without logging.)
* Handle all other smtplib.SMTPException by deferring
  message. (Previously, behavior differed between
  Python 2 and 3.)
* Handle all other exceptions by logging error with new
  RESULT_ERROR status and deleting message from send
  queue. (Previously, crashed without logging, leaving
  possibly-unsendable message blocking send queue.)

**Potentially-breaking change:** If you are using a
MAILER_EMAIL_BACKEND other than the default
SMTP EmailBackend, errors from that backend *may*
be handled differently now.

Addresses pinax#73. (And additional discussion there.)
@gnunicorn
Copy link

Facing the same issue when using sparkpost here, as this raises its own exceptions, which aren't caught and then the whole queue stalls and I have to manually delete the messages from the queue before it can continue..

released.
Traceback (most recent call last):
  File "manage.py", line 22, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 359, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/base.py", line 294, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/base.py", line 345, in execute
    output = self.handle(*args, **options)
  File "/usr/local/lib/python3.6/site-packages/mailer/management/commands/send_mail.py", line 26, in handle
    send_all()
  File "/usr/local/lib/python3.6/site-packages/mailer/engine.py", line 144, in send_all
    email.send()
  File "/usr/local/lib/python3.6/site-packages/django/core/mail/message.py", line 342, in send
    return self.get_connection(fail_silently).send_messages([self])
  File "/usr/local/lib/python3.6/site-packages/sparkpost/django/email_backend.py", line 29, in send_messages
    response = self._send(SparkPostMessage(message))
  File "/usr/local/lib/python3.6/site-packages/sparkpost/django/email_backend.py", line 39, in _send
    return self.client.transmissions.send(**params)
  File "/usr/local/lib/python3.6/site-packages/sparkpost/transmissions.py", line 254, in send
    results = self.request('POST', self.uri, data=json.dumps(payload))
  File "/usr/local/lib/python3.6/site-packages/sparkpost/base.py", line 41, in request
    **kwargs)
  File "/usr/local/lib/python3.6/site-packages/sparkpost/base.py", line 16, in request
    raise SparkPostAPIException(response)
sparkpost.exceptions.SparkPostAPIException: Call to https://api.sparkpost.com/api/v1/transmissions returned 400, errors:

        Message generation rejected Code: 1902 Description: recipient address was suppressed due to customer policy 

I personally favor option B, maybe with the option to email the debug to the system administrators in that case?

@spookylukey
Copy link
Contributor

@medmunds Sorry for the really long delay on this one.

These proposed changes sound good, assuming that as well as deleting from the queue, we create a MessageLog with the original data, so it is not lost forever. As I think you suggested we would use a new RESULT_CODE for this case.

gnunicorn added a commit to DemokratieInBewegung/django-mailer that referenced this issue Jun 30, 2017
@medmunds
Copy link
Contributor Author

@spookylukey Sorry, I just realized this was still open and I'd been missing notifications on it.

Unfortunately, I haven't been actively using django-mailer for some time now, and given that combined with the ambiguity of the various email exceptions, I'm not really confident I can move this forward without potentially injecting other problems.

If someone else is able to adopt PR #78, that'd be fine, but I'd also completely understand if it makes more sense to just close it unmerged.

@taleinat
Copy link

Instead of trying to distinguish between transient and permanent errors, and possibly deleting emails (which is a deal breaker for my use cases), I propose a different approach: Count how many times a messages has been deferred (add a field to the model), and move messages deferred more than X times to a new state, in which they will no longer be retried, but will be kept in the DB for introspection. This requires a DB migration, but could simplify the code and be more robust. I would be willing to work on getting this implemented if there is interest!

@volksman
Copy link
Contributor

@taleinat I'm starting to feel that the only lib django-mailer should support out of the box is smtplib as it's the most logical. Anything else would require the developer to decide how to (better) handle their errors and more specifically errors from their chosen mailer backend.

I believe that the PR I just submitted achieves that. Feedback very welcome. I've never submitted a PR before let alone some code that handles exceptions like this so I'm not even sure if this is acceptable best practice.

@volksman
Copy link
Contributor

@taleinat I think adding a retry count would still be beneficial. Adds that much more flexibility for the dev to decide how best to deal with error messages. Thinking that should be brought in through a separate PR.

@taleinat
Copy link

taleinat commented Oct 26, 2017

@volksman I'd be happy to help with that. However, if there isn't going to be specific built-in support for Amazon SES nor a plug-in mechanism enabling addition of such support, I won't be able to justify working on that at this point, since I won't actually be able to use this library on my current project.

@volksman
Copy link
Contributor

@taleinat take a look at PR #92 and let me know if that works for you. You can define your own error handler and it seems to be working in my production env... :)

@taleinat
Copy link

@volksman PR #92 would work, though to me it seems overly generic. I think in most cases the only change needed would be to the list of exceptions to catch. I suggest adding a helper, an error-handler factory which does the default logic with a different set of exceptions to catch.

@hartwork
Copy link
Contributor

hartwork commented Jun 22, 2019

Hi!

There are a bunch of current and past pull requests on this topic; my impression is this:

For ways to go forward I see:

For (b) and (c) I would volunteer, provided that someone has some time for review and permission to merge and wants to see this finished, too.

What do you think?

CC: @pakal @spookylukey

@pakal
Copy link
Contributor

pakal commented Jun 22, 2019

I can't merge and I've finally integrated a different mailer approach, but feel free to ping me for some formal code reviews :)

@hartwork
Copy link
Contributor

Okay thanks! @spookylukey what do you think?

@spookylukey
Copy link
Contributor

@hartwork . First very sorry for being unresponsive!

I think the lack of progress on the more complex approaches means we should go for something simpler like your a) or c), probably c) if possible. This will be better than trying for a perfect solution. I will try to review a patch like that in a timely manner, and I can merge it.

@volksman volksman mentioned this issue Oct 14, 2020
@volksman
Copy link
Contributor

@hartwork @spookylukey I had to upgrade my project to django3 so I'm back with my suggestions in a PR again. Let me know...

@hartwork
Copy link
Contributor

@volksman there are three open pull requests by you targetting #73 now. Pull requests can be re-used, updated, rebased, so ideally there would be one pull request that goes through a loop of review and fixing until some consensus and a solid implementation is reached. A quick look at #127 shows that method handle_backend_exception has not been tested and and has two bugs minimum. It's not marked as a draft or work-in-progress and the CI is read as well. I'm not very happy about it.

@volksman
Copy link
Contributor

@hartwork It has been at least 2 years since my first PR. Rebasing seemed like a waste given the project has undergone an entire structural change. I've closed off the old PR in favour of this one and the original bandaid that I still use in production.

Sorry I don't know all the ins and outs of Github, CI etc...I work alone for the most part so collaboration is not my strong suit, however if it's really that bothersome I can stop trying to contribute. Would just be nice to have a solution one way or another with regard to transactional email providers vs traditional SMTP. I can live off my fork as I have for the last 2 years. All the best!

@hartwork
Copy link
Contributor

@volksman can #91 be closed as well? I'm happy that you contribute. Current #127 is not thought to an end yet and would not work well when merged. Let's make #127 work and work well and then get someone with permissions to merge it. Everyone will be happy.

@volksman
Copy link
Contributor

#91 can be closed off as it will never work anymore, however the solution is still sound and works great for my needs. That is my current implementation in production on a couple sites.

@hartwork
Copy link
Contributor

hartwork commented Oct 14, 2020

I see. @volksman could you rebase #91 in place to fix its conflicts with master?
@spookylukey as of today, would you rather see the approach in #91 merged or the approach in #127 ? Who can we turn to who has merge permissions?

@spookylukey
Copy link
Contributor

@volksman @hartwork
Thanks so much for your work on this, and your patience over a long period of time. There isn't anyone else apart from me with merge permissions, apart from maybe Pinax people but they don't contribute here. I confess I'm lacking seriously lacking time and energy for open source work in general, and I'm struggling to remember all the backwards and forwards on this issue. So here's what I'll do:

  • I'd like to see a clear section in the docs about the whole error handling strategy, with clear docs about the new settings. Also a note in the changelog with particular attention to any backwards incompatibilities
  • I'll defer to your judgement since you've been doing all the hard work on this, and some solution would be better than we what have at the moment. So, once you two have consensus that the work is done, I'll merge the PR.

@volksman
Copy link
Contributor

volksman commented Oct 15, 2020

I see. @volksman could you rebase #91 in place to fix its conflicts with master?

Not able to rebase as I destroyed my fork and GitHub doesn't allow you to replace it in the PR.

I have added a small paragraph explaining the optional settings key and the optional override of the error handler. I don't believe there are any backwards compatibility issues.

I am now using the error_handler (#127 ) branch on one of my production servers without any changes (not overriding the error handler).

@hartwork
Copy link
Contributor

hartwork commented Oct 17, 2020

@spookylukey sounds like a plan — thank you! Can you close #91 and #85 with a hint on #127 maybe, for us?

spookylukey added a commit that referenced this issue Nov 6, 2020
…or-handler

Introduce delivery error handler (related to #73 and #127)
@spookylukey
Copy link
Contributor

Fixed via #133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants