New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging makes actor hang #32
Comments
Hi @andatt, Sorry for the delay: just got back from vacation. There are two parts to logging: each individual actor log statement causes a message to be sent to the Logger actor (managed automatically by the system) and the Logger actor performing the logging action to files. If you are getting some logging, then it sounds like both parts generally work, so there must be some system limit or trigger within the docker system that is causing things to stop. One consideration is that if the logger does not process log messages then they build up in the mailbox and eventually the regular actor will hang on a logging call (similar to the way it would hang on a Two things you might look at for additional hints would be either the journalctl/syslog information from the docker container itself, and the internal logging provided by the |
Hi Kevin Thanks for the reply, no problem at all. Journalctl says no journal files found. I have checked /var/log for any other useful logfiles but cannot find any contents of /var/log is:
The entire contents of the thespian log you describe is:
That does seem to indicate things are locking. I have had a look at all logfiles in var/log, the docker logging iteself just ouptputs the print statements from the actor. The of course cease when the hang takes place. Any ideas? Edit 1: There is plenty of disk space in the container so that doesn't seem to be the cause of the issue. Also these are the processes running when exec into the container while the hang is ongoing:
I changed the names of the actors so its a bit clearer. Not sure if that sheds any more light, it doesn't to me. |
The two lines in the thespian.log show that process 82 (Hanging Actor) is getting into the blocked state trying to send messages (950 are queued, and it won't start sending again until it drops below 780 queued). The second line shows that the MultiProcAdmin is trying to send something to an actor that isn't responding (very probably the Hanging Actor, but there's not enough info to correlate port numbers to processes yet). You can try setting the You might need to check host logfiles or update the docker logging configuration (https://docs.docker.com/config/containers/logging/configure/) to get access to useful logs. I'm thinking these may be useful for general information about what is happening in the docker container. I'm also suspicious that there is no Thespian logging happening, because the logger process shows 0:00 cpu time, whereas the Hanging Actor shows considerable usage, and handling enough messages that at least 950 are queued should have used some cpu time. You may want to experiment with explicitly configuring the Thespian logging and running with minimal log output to see if it is generating anything at all in the docker container. Your post above is also helpful in that it shows your docker configuration is pretty simple, so I will try to do some local Thespian+docker testing over the next day or two to see if I can reproduce your issue and get any additional insight on my end. |
I just re-read your original post and saw that you do have explicit logging configuration, so I would definitely be interested in whether anything was written to the |
Yes there is log output in in log_file_path_2. This is the entire contents:
This is much less than the expected output. So the hang seems to occur after these messages are generated. |
Just to clarify the above further, what's missing is all the logging messages inside the for loop. So I believe the problem should be able to be replicated by:
|
Hi @andatt, Thanks for your report and the additional information you provided. The problem was a deadlock bug in the asynchronous transmit overflow handling code, which should be fixed by d6f13a5. Please test this fix in your environment and if it resolves the issue I will make a Thespian release to resolve this. -Kevin |
Hi Kevin Thanks great thanks! Just tested and the fix works - runs through the loop no trouble at all. Thanks for your efforts in investigating and fixing! :) |
Thanks for confirming the fix. I've generated release 3.9.5 which is available from github (https://github.com/kquick/Thespian/releases/tag/thespian-3.9.5) and pypi.org (https://pypi.org/project/thespian/3.9.5/) to make this fix official. |
Been having this issue intermittently for a while now. When there is a large amount of logging output, for example a logging message inside a for loop, then beyond a certain number of iterations the actor will hang.
I have confirmed the cause is the logging statement inside the for loop as everything behaves as expected when logging statement is removed. I tried adding additional log files in case there is some file locking issue. I am using the standard logging config as recommended in the docs:
Is there any known issues around logging that could be causing this?
Edit 1:
Some additional information - when exiting the hanging actor with control-c the Thespian part of the the traceback looks like:
Edit 2:
This actor system / actor is being run inside a docker container. I have run the same code outside docker and the issue is resolved. So the problem in question seems to be caused by some interaction between logging and docker.
The text was updated successfully, but these errors were encountered: