Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error sending message in campaign X: subscriber Y: timed out waiting for free conn in pool #734

Closed
nkosi23 opened this issue Mar 2, 2022 · 6 comments
Labels
needs-investigation Potential bug. Needs investigation

Comments

@nkosi23
Copy link

nkosi23 commented Mar 2, 2022

Version:

  • listmonk: [eg: v2.0.0]
  • OS: FreeBSD

Description of the bug and steps to reproduce:
After a number of emails are sent (about 100,000 or a couple of 100,000s), while the campaign continues to run and the emails sent counter continues to increase, no email is actually sent and logs show the error message:

2022/03/02 05:09:50 error sending message in campaign X: subscriber 4e66058f-f2b7-4cc9-af7b-c2bb4db32305: timed out waiting for free conn in pool

The default behavior of listmonk was working fine: because of this the campaign used to stop because too many such errors were produced but we have intentionally disabled this behavior because we have large subscriber lists and transient errors used to hit the threshold and suspend our campaigns too often.

I initially thought that the problem came from the SMTP Server and that maybe it exhausted its connection pool, however it turns out that when the problem happens the SMTP Server has no load and no incoming connection related to SMTP: listmonk's failure seems to happen before even contacting the SMTP Server.

I've therefore looked into increasing the size of Postgres connection pool assuming that the error could actually come from the database. However Postgres shows that it has no active connections from listmonk (sometimes 1 or 2 related to database requests made by the pages of listmonk's dashboard) so this is unlikely to be the issue.

The problem seem to go away on its own after waiting and trying to restart the campaign a few times. To clarify pausing and restarting the campaign does not help, but it looks like as some point, (a couple minutes - but this can be way more than 10 minutes I feel) the campaign works again when it is restarted.

For example, when I decided to seriously look into solving this recurring problem yesterday over the course of maybe 30 minutes to hour, maybe even more, I've tried to restart the campaign without success (the campaign does restart on listmonk but just produces the aforementioned error message in the log). So it can actually last for quite a while. Letting the campaign run does not make the problem disappear on its own either (when it happens this error is not transient).

Restarting listmonk through the dashboard (by changing some settings and clicking the restart button) does not solve the issue. However as I was typing this message, I noticed that this action does not actually restart the main listmonk process (I could see it had an uptime of 31 days days), so I just tried restarting the system process and the campaign is now working again.

I will need to confirm that this fix works by trying to restart the system process if the problem happens again. The issue does not seem to come from the environment since all systems indicators are fine (number of connections etc...). The environment was my main assumption since listmonk is in a container behind a nat, and the database is on a different virtual host etc... but after digging I haven't been able to find any exhaustion issue in the environment so it looks like an internal problem with listmonk.

@nkosi23 nkosi23 added the bug Something isn't working label Mar 2, 2022
@knadh
Copy link
Owner

knadh commented Mar 2, 2022

timed out waiting for free conn in pool happens when all the connections in the SMTP pool are busy sending messages (and none could return during the wait period). This can happen if:

  • There are too few connections in the pool.
  • The available connections are slow to send messages (slow network, slow SMTP). The default timeout is 5s, but it should ideally never take 5 seconds (or even 1 second) to send a single message.

Please share your settings:
image

If you're sending a large number of e-mails, your max. connections settings should most likely be in the 10s, or even 100s, depending on the capacity of your upstream SMTP.

@knadh knadh added needs-investigation Potential bug. Needs investigation and removed bug Something isn't working labels Mar 2, 2022
@nkosi23
Copy link
Author

nkosi23 commented Mar 2, 2022

Thanks a lot for your swift reply, the settings are:
image
Max connections used to be 200, but I decreased it as I suspected that maybe the SMTP server was being overloaded (very likely since it is a small virtual machine and the load is very high on it when listmonk sends these campaigns - 5 to 8 load average and the machine has a single core).

What do you suggest when there is a large number of messages to send but the SMTP Server is not powerful enough to have a large number of connections?

@nkosi23
Copy link
Author

nkosi23 commented Mar 2, 2022

(Beyond configuration listmonk seems to have issues recovering from this kind of failure as it is unable to resume sending even after the SMTP Server has no load whatsoever as explained in my message).

@knadh
Copy link
Owner

knadh commented Mar 2, 2022

however it turns out that when the problem happens the SMTP Server has no load and no incoming connection related to SMTP: listmonk's failure seems to happen before even contacting the SMTP Server.

Your settings look fine. This behaviour seems to be at the network level then. Firewall/proxy/iptables/Docker network? listmonk just opens standard TCP connections to the SMTP server. If those connections are timing out where each message takes ~5 seconds, constantly keeping the pool locked so that it's never cleared (connections are cleared when they're idle for ~15 seconds), then it's something to do with the network.

With slow connections, you should immediately see the max allowed 25 connections all attempting to send messages. Here, outgoing connections aren't even being established.

@nkosi23
Copy link
Author

nkosi23 commented Mar 2, 2022

This was my thinking also but as explained in my post there is no problem at the network level. I have spent a lot of time inspecting network traffic on various hosts, checking the internal state of firewalls, and so on... This is not where the problem is coming from.

Moreover the problem has happened again and I can now 100% confirm that it is fixed by restarting the listmonk system process. So there is somehow an issue with connection management in listmonk. Apparently listmonk is unable to recover after there have been transient "timed out waiting for free conn in pool" errors.

@nkosi23
Copy link
Author

nkosi23 commented Mar 3, 2022

I have increased the waiting time of the SMTP settings as per the below:

image

And so far the problem has not appeared again yet. I believe the very high pressure caused by the overloaded server and the short waiting time caused subtle deadlocks in smtppool.

@knadh knadh closed this as completed Mar 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation Potential bug. Needs investigation
Projects
None yet
Development

No branches or pull requests

2 participants