Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging makes actor hang #32

Closed
andatt opened this issue Dec 3, 2018 · 9 comments
Closed

Logging makes actor hang #32

andatt opened this issue Dec 3, 2018 · 9 comments

Comments

@andatt
Copy link

andatt commented Dec 3, 2018

Been having this issue intermittently for a while now. When there is a large amount of logging output, for example a logging message inside a for loop, then beyond a certain number of iterations the actor will hang.

I have confirmed the cause is the logging statement inside the for loop as everything behaves as expected when logging statement is removed. I tried adding additional log files in case there is some file locking issue. I am using the standard logging config as recommended in the docs:

class ActorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' in logrecord.__dict__


class NotActorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' not in logrecord.__dict__


def log_config(log_file_path_1, log_file_path_2):
    return {
        'version': 1,
        'formatters': {
            'normal': {'format': '%(levelname)-8s %(message)s'},
            'actor': {'format': '%(levelname)-8s %(actorAddress)s => %(message)s'}},
        'filters': {'isActorLog': {'()': ActorLogFilter},
                    'notActorLog': {'()': NotActorLogFilter}},
        'handlers': {'h1': {'class': 'logging.FileHandler',
                            'filename': log_file_path_1,
                            'formatter': 'normal',
                            'filters': ['notActorLog'],
                            'level': logging.INFO},
                     'h2': {'class': 'logging.FileHandler',
                            'filename': log_file_path_2,
                            'formatter': 'actor',
                            'filters': ['isActorLog'],
                            'level': logging.INFO}, },
        'loggers': {'': {'handlers': ['h1', 'h2'], 'level': logging.DEBUG}}
    }

Is there any known issues around logging that could be causing this?

Edit 1:

Some additional information - when exiting the hanging actor with control-c the Thespian part of the the traceback looks like:

  File "/app/tests/testing_helpers.py", line 505, in wrapper
    return wrapped_func(*args, **kwargs)
  File "/app/tests/system/system_test_structure.py", line 56, in _get_test_results
    actor_system.ask(defd_actor, setup_message)
  File "/usr/local/lib/python3.5/site-packages/thespian/actors.py", line 736, in ask
    return self._systemBase.ask(actorAddr, msg, timeout)
  File "/usr/local/lib/python3.5/site-packages/thespian/system/systemBase.py", line 264, in ask
    response = self._run_transport(remTime.remaining())
  File "/usr/local/lib/python3.5/site-packages/thespian/system/systemBase.py", line 140, in _run_transport
    max_runtime.view().remaining())
  File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/wakeupTransportBase.py", line 71, in run
    rval = self._run_subtransport(incomingHandler)
  File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
    rval = self._runWithExpiry(incomingHandler)
  File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/TCPTransport.py", line 1140, in _runWithExpiry
    set(wsend+wrecv), delay)

Edit 2:

This actor system / actor is being run inside a docker container. I have run the same code outside docker and the issue is resolved. So the problem in question seems to be caused by some interaction between logging and docker.

@kquick
Copy link
Owner

kquick commented Dec 12, 2018

Hi @andatt,

Sorry for the delay: just got back from vacation.

There are two parts to logging: each individual actor log statement causes a message to be sent to the Logger actor (managed automatically by the system) and the Logger actor performing the logging action to files. If you are getting some logging, then it sounds like both parts generally work, so there must be some system limit or trigger within the docker system that is causing things to stop.

One consideration is that if the logger does not process log messages then they build up in the mailbox and eventually the regular actor will hang on a logging call (similar to the way it would hang on a self.send(...) if the recipient was not processing messages) until the logger starts reading messages again. You might check to see if you are running out of disk space for writing logs in the docker environment, and that there is nothing in the docker environment that would decide the logger was not operating correctly and needed to be killed (e.g. the OOM killer or something like that).

Two things you might look at for additional hints would be either the journalctl/syslog information from the docker container itself, and the internal logging provided by the thespian/system/utilis.py file. That file has functionality used to debug Thespian itself and writes internal logging to $THESPLOG_FILE (defaulting to $TMPDIR/thespian.log), limiting the file to $THESPLOG_FILE_MAXSIZE (defaulting to 50KB) with a logging level of $THESPLOG_THRESHOLD (defaulting to WARNING). If you would like to post a gist of this latter log I'd be happy to take a look and see if there is anything in it that would reveal what your issue is.

@andatt
Copy link
Author

andatt commented Dec 13, 2018

Hi Kevin

Thanks for the reply, no problem at all. Journalctl says no journal files found. I have checked /var/log for any other useful logfiles but cannot find any contents of /var/log is:

alternatives.log  dmesg             fontconfig.log    wtmp
apt/              dpkg.log          fsck/             
btmp              faillog           lastlog 

The entire contents of the thespian log you describe is:

2018-12-13 11:38:44.490334 p82 Warn Entering tx-only mode to drain excessive queue (950 > 950, drain-to 780)
2018-12-13 11:38:44.490334 p78 Warn Transmit attempt from ActorAddr-(T|:1900) to ActorAddr-(T|:44925) timed out, returning PoisonPacket

That does seem to indicate things are locking. I have had a look at all logfiles in var/log, the docker logging iteself just ouptputs the print statements from the actor. The of course cease when the hang takes place. Any ideas?

Edit 1:

There is plenty of disk space in the container so that doesn't seem to be the cause of the issue.

Also these are the processes running when exec into the container while the hang is ongoing:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.2 205088 48860 pts/0    Ssl+ 11:38   0:01 python -m unitt
root        76  0.0  0.0      0     0 ?        Zs   11:39   0:00 [python] <defun
root        78  0.0  0.2 209208 43892 ?        Sl   11:39   0:00 MultiProcAdmin 
root        80  0.0  0.2 209208 42480 ?        Sl   11:39   0:00 logger ActorAdd
root        82 13.4  0.2 214300 48352 ?        Sl   11:39  17:52 Hanging Actor
root        84  0.0  0.2 209208 42196 ?        Sl   11:39   0:00 A Global Actor
root       113  0.0  0.0  21956  3616 pts/1    Ss   12:05   0:00 bash
root       125  0.0  0.0  19192  2328 pts/1    R+   13:52   0:00 ps aux

I changed the names of the actors so its a bit clearer. Not sure if that sheds any more light, it doesn't to me.

@kquick
Copy link
Owner

kquick commented Dec 13, 2018

The two lines in the thespian.log show that process 82 (Hanging Actor) is getting into the blocked state trying to send messages (950 are queued, and it won't start sending again until it drops below 780 queued). The second line shows that the MultiProcAdmin is trying to send something to an actor that isn't responding (very probably the Hanging Actor, but there's not enough info to correlate port numbers to processes yet).

You can try setting the THESPLOG_THRESHOLD environment variable to 'Debug' (or maybe just 'Info') to get a little more information in that thespian.log file.

You might need to check host logfiles or update the docker logging configuration (https://docs.docker.com/config/containers/logging/configure/) to get access to useful logs. I'm thinking these may be useful for general information about what is happening in the docker container.

I'm also suspicious that there is no Thespian logging happening, because the logger process shows 0:00 cpu time, whereas the Hanging Actor shows considerable usage, and handling enough messages that at least 950 are queued should have used some cpu time. You may want to experiment with explicitly configuring the Thespian logging and running with minimal log output to see if it is generating anything at all in the docker container.

Your post above is also helpful in that it shows your docker configuration is pretty simple, so I will try to do some local Thespian+docker testing over the next day or two to see if I can reproduce your issue and get any additional insight on my end.

@kquick
Copy link
Owner

kquick commented Dec 13, 2018

I just re-read your original post and saw that you do have explicit logging configuration, so I would definitely be interested in whether anything was written to the log_file_path_1 and log_file_path_2files with the minimal logging configuration.

@andatt
Copy link
Author

andatt commented Dec 13, 2018

Yes there is log output in in log_file_path_2. This is the entire contents:

INFO     ActorAddr-(T|:35411) => SETTING UP ACTOR [name] ...
INFO     ActorAddr-(T|:35411) => ACTOR [name] SETUP COMPLETE
INFO     ActorAddr-(T|:35411) => [name] waking up at 2018-12-04 11:03:32...

This is much less than the expected output. So the hang seems to occur after these messages are generated.

@andatt
Copy link
Author

andatt commented Dec 14, 2018

Just to clarify the above further, what's missing is all the logging messages inside the for loop. So I believe the problem should be able to be replicated by:

  1. creating an actor system inside a docker container.
  2. In that system there should be an actor which contains a for loop.
  3. Inside this for loop there should be a logging statement.
  4. There can also be a logging statement placed before the start of the for loop (to produce something similar to the log above)
  5. That for loop should have 1000+ iterations.

@kquick
Copy link
Owner

kquick commented Dec 17, 2018

Hi @andatt,

Thanks for your report and the additional information you provided. The problem was a deadlock bug in the asynchronous transmit overflow handling code, which should be fixed by d6f13a5. Please test this fix in your environment and if it resolves the issue I will make a Thespian release to resolve this.

-Kevin

@andatt
Copy link
Author

andatt commented Dec 17, 2018

Hi Kevin

Thanks great thanks! Just tested and the fix works - runs through the loop no trouble at all. Thanks for your efforts in investigating and fixing! :)

@kquick
Copy link
Owner

kquick commented Dec 17, 2018

Thanks for confirming the fix. I've generated release 3.9.5 which is available from github (https://github.com/kquick/Thespian/releases/tag/thespian-3.9.5) and pypi.org (https://pypi.org/project/thespian/3.9.5/) to make this fix official.

@kquick kquick closed this as completed Dec 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants