Sending message to remote actor systems #24

Closed
davidhonfi opened this Issue Nov 1, 2016 · 18 comments

Comments

Projects
None yet
4 participants
@davidhonfi

I just recently started using Thespian, its features are really great. However, I have just ran into a limitation/bug, which I cannot solve. My situation is the following.

I have multiple actor systems on different machines, including a dedicated system that has a coordinator actor (this system is the convention leader). I would like to distribute the address of the coordinator actor for some specific actors in other, remote actor systems (to be able to send back results of jobs). I am currently using the convention update system message for this purpose (the coordinator actor is subscribed to this).

My problem here is that I cannot send a message to the actor system that is just connected to the convention. Using the message handler of the coordinator I am sending a message to the remoteAdminAddress (received in the convention update message). However, this message is not being delivered to the remote, recently connected system (the remote waits for a message with ActorSystem.listen()).

Also note that the coordinator does not throw any exception, the sending finishes successfully. I have tried configuring admin routing as well, but it did not help.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Nov 1, 2016

Contributor

I'm glad you are enjoying Thespian so far.

If I understand correctly, you have multiple systems and actors that started independently on those systems and now you are wanting to ensure that the actors on the remote systems have the address of the coordinator actor?

If this is the case, that is an unusual configuration. Typically, actors exist in a tree of parent/child relationships. It's possible for actors to communicate with any other actor provided the address, but the address is typically passed as part of a message.

You are getting the remoteAdminAddress from the convention update, but the address provided is that of the admin itself, which is running in the background and responsible for keeping everything running on the local system. The admin will not respond to arbitrary messages; only to system messages it knows about.

The customary usage model is that the actors on the remote systems are created as children of an actor on the central system; for example, the coordinator is responsible for starting the top-level actors on those remote systems and therefore they automatically know about each other. The convention notification and associated admin address are generally intended for bookkeeping purposes (for example, we use it to keep a database updated for which remote systems are "online" at any point in time), but actors generally attempt to maintain location independence by not specifically targeting systems (instead, the capabilities and requirements are used to drive the placement of new actors).

The multi-system examples (and specifically https://github.com/godaddy/Thespian/tree/master/examples/multi_system/act5) show an example of how this can be done. I realize this is a pretty high-level response; there are some compelling elements of the architecture I described above that might help drive your model, but I would like to also know more about your model to make sure Thespian supports a broad set of usage categories, so I'm happy to discuss further details on this if you would like.

A couple of other notes:

  • ActorSystem.listen() should only be used from non-actor code; actors themselves simply have their receiveMessage() methods invoked when incoming messages are received. I'm not sure how you are using it above, but that might be another issue in your usage.
  • The send() operates asynchronously. If the target is known to be dead, it will deliver the message to the dead letter handler (if any). If the message can be delivered to the target but causes the target actor to throw an exception twice, it will be sent back to the original actor in a PoisonMessage wrapper. There are retries with an increasing backoff period for non-responsive targets, culminating in a declaration that the target is dead and delivery of the message via dead letter routing. You should never get an exception from a send(), but it may result in one of the two above "failures" at a future point. In this case, the Admin probably just ignored the unrecognized message, so there was no indication to your actor of any problem.
  • Admin routing is usually used in conjunction with TXOnly for handling network topologies that have a firewall preventing bi-directional travel. If you don't have one, I would not recommend this mode as there is additional delay/overhead in routing all messages through the Admin.

I hope the above helps, and I would be happy to discuss any of the points in more detail as well as learn more about your usage model. Feedback on the documentation and examples is also welcomed.

-Kevin

Contributor

kwquick commented Nov 1, 2016

I'm glad you are enjoying Thespian so far.

If I understand correctly, you have multiple systems and actors that started independently on those systems and now you are wanting to ensure that the actors on the remote systems have the address of the coordinator actor?

If this is the case, that is an unusual configuration. Typically, actors exist in a tree of parent/child relationships. It's possible for actors to communicate with any other actor provided the address, but the address is typically passed as part of a message.

You are getting the remoteAdminAddress from the convention update, but the address provided is that of the admin itself, which is running in the background and responsible for keeping everything running on the local system. The admin will not respond to arbitrary messages; only to system messages it knows about.

The customary usage model is that the actors on the remote systems are created as children of an actor on the central system; for example, the coordinator is responsible for starting the top-level actors on those remote systems and therefore they automatically know about each other. The convention notification and associated admin address are generally intended for bookkeeping purposes (for example, we use it to keep a database updated for which remote systems are "online" at any point in time), but actors generally attempt to maintain location independence by not specifically targeting systems (instead, the capabilities and requirements are used to drive the placement of new actors).

The multi-system examples (and specifically https://github.com/godaddy/Thespian/tree/master/examples/multi_system/act5) show an example of how this can be done. I realize this is a pretty high-level response; there are some compelling elements of the architecture I described above that might help drive your model, but I would like to also know more about your model to make sure Thespian supports a broad set of usage categories, so I'm happy to discuss further details on this if you would like.

A couple of other notes:

  • ActorSystem.listen() should only be used from non-actor code; actors themselves simply have their receiveMessage() methods invoked when incoming messages are received. I'm not sure how you are using it above, but that might be another issue in your usage.
  • The send() operates asynchronously. If the target is known to be dead, it will deliver the message to the dead letter handler (if any). If the message can be delivered to the target but causes the target actor to throw an exception twice, it will be sent back to the original actor in a PoisonMessage wrapper. There are retries with an increasing backoff period for non-responsive targets, culminating in a declaration that the target is dead and delivery of the message via dead letter routing. You should never get an exception from a send(), but it may result in one of the two above "failures" at a future point. In this case, the Admin probably just ignored the unrecognized message, so there was no indication to your actor of any problem.
  • Admin routing is usually used in conjunction with TXOnly for handling network topologies that have a firewall preventing bi-directional travel. If you don't have one, I would not recommend this mode as there is additional delay/overhead in routing all messages through the Admin.

I hope the above helps, and I would be happy to discuss any of the points in more detail as well as learn more about your usage model. Feedback on the documentation and examples is also welcomed.

-Kevin

@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Nov 2, 2016

It is getting clearer now, thanks. So, in my case the best and supported solution would be to create each top-level actors on remote systems from the coordinator. This would result in a large distributed tree of actors. If that is the case, my only requirement is to have only one top-level ("host coordinator") actor on each remote actor system. How can I ensure that with capabilities only? More generally, how can I ensure that there is only one instance of a specific type of actor in an actor system? Or should I avoid this during designing the architecture?

The rest of the job is clear: the top-level local admins create children actors in their own remote systems, which are now indirectly connected to the coordinator (grandparent) and can send messages to him.

It is getting clearer now, thanks. So, in my case the best and supported solution would be to create each top-level actors on remote systems from the coordinator. This would result in a large distributed tree of actors. If that is the case, my only requirement is to have only one top-level ("host coordinator") actor on each remote actor system. How can I ensure that with capabilities only? More generally, how can I ensure that there is only one instance of a specific type of actor in an actor system? Or should I avoid this during designing the architecture?

The rest of the job is clear: the top-level local admins create children actors in their own remote systems, which are now indirectly connected to the coordinator (grandparent) and can send messages to him.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Nov 2, 2016

Contributor

The need to have "only one top-level actor on each remote actor system" should be manageable by the coordinator: the coordinator registers for convention notifications and when it receives one, it can call createActor with requirements that match the capabilities of the new remote system. This implies that there is at least one unique capability on each started remote Thespian instance; this is typically provided by whatever startup mechanism is being used, and could be any value you choose: an IP address, a hostname, a UUID, etc. Because the coordinator (or its delegate) is creating the remote actors, it knows which systems have remote actors and can ensure that there is exactly one on each remote system.

We have found that the pattern of single responsibility is a big benefit when implementing an actor-based application or service. For example, if your coordinator is also acting as the entry point for requests that will be handled by the system, then it may be convenient to create a separate "registrar" actor that manages the existence of top level actors. The registrar's responsibilities are:

  • Register for convention notifications

  • Upon notification of a new remote system:

     * Create a new host coordinator actor on that system
    
     * Record the address of that remote actor in a local dictionary
    
     * Push that address to the main coordinator
    
     * Provide that address on-demand to any asker
    
  • Upon notification of the exit of a remote system

     * Remove the host coordinator actor from the local dictionary
    
     * Send a notification (if needed) to the main coordinator
    
  • Upon notification of an child actor exit on the remote system

     * Create a new host coordinator actor on that system
    
     * Send update notification (if needed) to the main coordinator
    

This allows the main coordinator to simply request the host coordinator address for any remote system on demand from the registrar, and pushes all responsibility for maintaining those remote host coordinators to the registrar.

I don't think you should avoid this architecture, and in fact we have implemented something quite similar for the set of actors responsible for remote system management and monitoring. We also have other patterns active (simultaneously) where actors get placed based on other capabilities (for example, valid network access to a remote service, credentials for accessing a database, etc.), and there is work in progress to allow load-based automatic actor distribution which would allow the least-heavily utilized remote to be chosen when creating a new actor.

-Kevin


From: D?vid Honfi notifications@github.com
Sent: Wednesday, November 2, 2016 6:54:31 AM
To: godaddy/Thespian
Cc: Kevin W Quick; Comment
Subject: Re: [godaddy/Thespian] Sending message to remote actor systems (#24)

It is getting clearer now, thanks. So, in my case the best and supported solution would be to create each top-level actors on remote systems from the coordinator. This would result in a large distributed tree of actors. If that is the case, my only requirement is to have only one top-level ("host coordinator") actor on each remote actor system. How can I ensure that with capabilities only? More generally, how can I ensure that there is only one instance of a specific type of actor in an actor system? Or should I avoid this during designing the architecture?

The rest of the job is clear: the top-level local admins create children actors in their own remote systems, which are now indirectly connected to the coordinator (grandparent) and can send messages to him.

You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/godaddy/Thespian/issues/24#issuecomment-257870697, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFX0brQQg6SUQeFW5xdDml1NDa6z6xlfks5q6JYXgaJpZM4Kmiqh.

Contributor

kwquick commented Nov 2, 2016

The need to have "only one top-level actor on each remote actor system" should be manageable by the coordinator: the coordinator registers for convention notifications and when it receives one, it can call createActor with requirements that match the capabilities of the new remote system. This implies that there is at least one unique capability on each started remote Thespian instance; this is typically provided by whatever startup mechanism is being used, and could be any value you choose: an IP address, a hostname, a UUID, etc. Because the coordinator (or its delegate) is creating the remote actors, it knows which systems have remote actors and can ensure that there is exactly one on each remote system.

We have found that the pattern of single responsibility is a big benefit when implementing an actor-based application or service. For example, if your coordinator is also acting as the entry point for requests that will be handled by the system, then it may be convenient to create a separate "registrar" actor that manages the existence of top level actors. The registrar's responsibilities are:

  • Register for convention notifications

  • Upon notification of a new remote system:

     * Create a new host coordinator actor on that system
    
     * Record the address of that remote actor in a local dictionary
    
     * Push that address to the main coordinator
    
     * Provide that address on-demand to any asker
    
  • Upon notification of the exit of a remote system

     * Remove the host coordinator actor from the local dictionary
    
     * Send a notification (if needed) to the main coordinator
    
  • Upon notification of an child actor exit on the remote system

     * Create a new host coordinator actor on that system
    
     * Send update notification (if needed) to the main coordinator
    

This allows the main coordinator to simply request the host coordinator address for any remote system on demand from the registrar, and pushes all responsibility for maintaining those remote host coordinators to the registrar.

I don't think you should avoid this architecture, and in fact we have implemented something quite similar for the set of actors responsible for remote system management and monitoring. We also have other patterns active (simultaneously) where actors get placed based on other capabilities (for example, valid network access to a remote service, credentials for accessing a database, etc.), and there is work in progress to allow load-based automatic actor distribution which would allow the least-heavily utilized remote to be chosen when creating a new actor.

-Kevin


From: D?vid Honfi notifications@github.com
Sent: Wednesday, November 2, 2016 6:54:31 AM
To: godaddy/Thespian
Cc: Kevin W Quick; Comment
Subject: Re: [godaddy/Thespian] Sending message to remote actor systems (#24)

It is getting clearer now, thanks. So, in my case the best and supported solution would be to create each top-level actors on remote systems from the coordinator. This would result in a large distributed tree of actors. If that is the case, my only requirement is to have only one top-level ("host coordinator") actor on each remote actor system. How can I ensure that with capabilities only? More generally, how can I ensure that there is only one instance of a specific type of actor in an actor system? Or should I avoid this during designing the architecture?

The rest of the job is clear: the top-level local admins create children actors in their own remote systems, which are now indirectly connected to the coordinator (grandparent) and can send messages to him.

You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/godaddy/Thespian/issues/24#issuecomment-257870697, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFX0brQQg6SUQeFW5xdDml1NDa6z6xlfks5q6JYXgaJpZM4Kmiqh.

@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Nov 7, 2016

Works as expected, thank you very much!

Now, I have only one question left. On the host of the Coordinator, I am shutting down the actor system using a "reaper pool" including every actor managed by the specific HostCoordinator. If the reaper pool is empty, then the HostCoordinator sends a message to its own ActorSystem to shutdown itself (its address was saved during the initialization).

I would also like to have this process on remote systems. However, in cases when the HostCoordinators are created by the Registrar, this method cannot be used as I cannot reach the address of the remotely hosting ActorSystem (i.e., a remote HostCoordinator cannot send a message to its own ActorSystem to shutdown).

Is there a way to automatically shut down remote actor systems when its actors have finished their jobs? Or should I create a separate process that monitors the ActorSystem for active actors?

Works as expected, thank you very much!

Now, I have only one question left. On the host of the Coordinator, I am shutting down the actor system using a "reaper pool" including every actor managed by the specific HostCoordinator. If the reaper pool is empty, then the HostCoordinator sends a message to its own ActorSystem to shutdown itself (its address was saved during the initialization).

I would also like to have this process on remote systems. However, in cases when the HostCoordinators are created by the Registrar, this method cannot be used as I cannot reach the address of the remotely hosting ActorSystem (i.e., a remote HostCoordinator cannot send a message to its own ActorSystem to shutdown).

Is there a way to automatically shut down remote actor systems when its actors have finished their jobs? Or should I create a separate process that monitors the ActorSystem for active actors?

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Nov 7, 2016

Contributor

Glad to hear things are progressing well for you!

We have customarily treated the Actor System itself as a background service (similar to sshd or inetd), but I can see the value in a cleanup-on-completion scenario. I'm curious as to how you are starting the Actor System on the remotes; if you are using something like systemd or upstart, then it may be most appropriate to utilize the same mechanism to effect the shutdown. One thing that comes to mind is that when your reaper has decided it is time to shutdown the remote, it can create an actor on the remote and send it a message that causes that actor to initiate the shutdown (via whatever means) on that remote system (local to it); there may be some timeouts and other abruption issues to deal with in this method.

Also please be aware that shutting down an actor will automatically and recursively send shutdown requests to its children, whether they are running locally or remotely, and also that shutting down an actor system will shutdown all actors running in that system.

-Kevin


From: D?vid Honfi notifications@github.com
Sent: Monday, November 7, 2016 3:25:01 AM
To: godaddy/Thespian
Cc: Kevin W Quick; Comment
Subject: Re: [godaddy/Thespian] Sending message to remote actor systems (#24)

Works as expected, thank you very much!

Now, I have only one question left. On the host of the Coordinator, I am shutting down the actor system using a "reaper pool" including every actor managed by the specific HostCoordinator. If the reaper pool is empty, then the HostCoordinator sends a message to its own ActorSystem to shutdown itself (its address was saved during the initialization).

I would also like to have this process on remote systems. However, in cases when the HostCoordinators are created by the Registrar, this method cannot be used as I cannot reach the address of the remotely hosting ActorSystem (i.e., a remote HostCoordinator cannot send a message to its own ActorSystem to shutdown).

Is there a way to automatically shut down remote actor systems when its actors have finished their jobs? Or should I create a separate process that monitors the ActorSystem for active actors?

You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/godaddy/Thespian/issues/24#issuecomment-258798686, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFX0bvPWXOCmuFXLRB2g8RVkdnWxIM24ks5q7vx9gaJpZM4Kmiqh.

Contributor

kwquick commented Nov 7, 2016

Glad to hear things are progressing well for you!

We have customarily treated the Actor System itself as a background service (similar to sshd or inetd), but I can see the value in a cleanup-on-completion scenario. I'm curious as to how you are starting the Actor System on the remotes; if you are using something like systemd or upstart, then it may be most appropriate to utilize the same mechanism to effect the shutdown. One thing that comes to mind is that when your reaper has decided it is time to shutdown the remote, it can create an actor on the remote and send it a message that causes that actor to initiate the shutdown (via whatever means) on that remote system (local to it); there may be some timeouts and other abruption issues to deal with in this method.

Also please be aware that shutting down an actor will automatically and recursively send shutdown requests to its children, whether they are running locally or remotely, and also that shutting down an actor system will shutdown all actors running in that system.

-Kevin


From: D?vid Honfi notifications@github.com
Sent: Monday, November 7, 2016 3:25:01 AM
To: godaddy/Thespian
Cc: Kevin W Quick; Comment
Subject: Re: [godaddy/Thespian] Sending message to remote actor systems (#24)

Works as expected, thank you very much!

Now, I have only one question left. On the host of the Coordinator, I am shutting down the actor system using a "reaper pool" including every actor managed by the specific HostCoordinator. If the reaper pool is empty, then the HostCoordinator sends a message to its own ActorSystem to shutdown itself (its address was saved during the initialization).

I would also like to have this process on remote systems. However, in cases when the HostCoordinators are created by the Registrar, this method cannot be used as I cannot reach the address of the remotely hosting ActorSystem (i.e., a remote HostCoordinator cannot send a message to its own ActorSystem to shutdown).

Is there a way to automatically shut down remote actor systems when its actors have finished their jobs? Or should I create a separate process that monitors the ActorSystem for active actors?

You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/godaddy/Thespian/issues/24#issuecomment-258798686, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFX0bvPWXOCmuFXLRB2g8RVkdnWxIM24ks5q7vx9gaJpZM4Kmiqh.

@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Nov 7, 2016

I was also trying this idea: when it is time to shut down a remote actor system, its host coordinator gets notified and tells its system to shut down itself. However, I am getting a retryable exception, when the host coordinator calls the ActorSystem().shutdown() method in receiveMessage(). Moreover, I tried invoking shutdown() when the HostCoordinator gets an ActorExitRequest, but nothing happens in this case (I think a deadlock occurs as the shutdown() waits for all actors to finish their ActorExitRequest receive functions [including the caller HostCoordinator]).

Are there any proper ways to shut down a local actor system from one of its actors?

I was also trying this idea: when it is time to shut down a remote actor system, its host coordinator gets notified and tells its system to shut down itself. However, I am getting a retryable exception, when the host coordinator calls the ActorSystem().shutdown() method in receiveMessage(). Moreover, I tried invoking shutdown() when the HostCoordinator gets an ActorExitRequest, but nothing happens in this case (I think a deadlock occurs as the shutdown() waits for all actors to finish their ActorExitRequest receive functions [including the caller HostCoordinator]).

Are there any proper ways to shut down a local actor system from one of its actors?

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Nov 7, 2016

Contributor

It's not something that we've done before, but it's not unreasonable to support something like this. I would have expected the ActorSystem().shutdown() to work, so I'll investigate this and get back to you.


From: D?vid Honfi notifications@github.com
Sent: Monday, November 7, 2016 12:59:42 PM
To: godaddy/Thespian
Cc: Kevin W Quick; Comment
Subject: Re: [godaddy/Thespian] Sending message to remote actor systems (#24)

I was also trying this idea: when it is time to shut down a remote actor system, its host coordinator gets notified and tells its system to shut down itself. However, I am getting a retryable exception, when the host coordinator calls the ActorSystem().shutdown() method in receiveMessage(). Moreover, I tried invoking shutdown() when the HostCoordinator gets an ActorExitRequest, but nothing happens in this case (I think a deadlock occurs as the shutdown() waits for all actors to finish their ActorExitRequest receive functions [including the caller HostCoordinator]).

Are there any proper ways to shut down a local actor system from one of its actors?

You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/godaddy/Thespian/issues/24#issuecomment-258945438, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFX0bmtlazy0BXE3p7s49JpKoD4cEeQWks5q74MugaJpZM4Kmiqh.

Contributor

kwquick commented Nov 7, 2016

It's not something that we've done before, but it's not unreasonable to support something like this. I would have expected the ActorSystem().shutdown() to work, so I'll investigate this and get back to you.


From: D?vid Honfi notifications@github.com
Sent: Monday, November 7, 2016 12:59:42 PM
To: godaddy/Thespian
Cc: Kevin W Quick; Comment
Subject: Re: [godaddy/Thespian] Sending message to remote actor systems (#24)

I was also trying this idea: when it is time to shut down a remote actor system, its host coordinator gets notified and tells its system to shut down itself. However, I am getting a retryable exception, when the host coordinator calls the ActorSystem().shutdown() method in receiveMessage(). Moreover, I tried invoking shutdown() when the HostCoordinator gets an ActorExitRequest, but nothing happens in this case (I think a deadlock occurs as the shutdown() waits for all actors to finish their ActorExitRequest receive functions [including the caller HostCoordinator]).

Are there any proper ways to shut down a local actor system from one of its actors?

You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/godaddy/Thespian/issues/24#issuecomment-258945438, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFX0bmtlazy0BXE3p7s49JpKoD4cEeQWks5q74MugaJpZM4Kmiqh.

@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Nov 8, 2016

Thank you for fast reply, I'm waiting for your response.

Thank you for fast reply, I'm waiting for your response.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Nov 8, 2016

Contributor

Hi David,

I apologize for the mixup: I had sent you a reply but had done it by email "reply" to a message in the middle of the chain, rather than the last message, not realizing that github would ignore a reply that didn't occur at the end. I'm posting my original reply below directly via github (and you can ignore the original if it ever pops of the ether):

---- original message follows ----

I've done some testing and while it's sub-optimal, it does basically work for me. I've included a little test utility below that I used; feel free to update this to match your usage scenario.

The utility also has two ways to trigger the shutdown (one line is commented out, feel free to uncomment it and comment out the following line).

  • If the ActorSystem.shutdown() is initiated from a regular message, the ActorSystem tries to send an ActorExitRequest to the Actor, which it cannot respond to because the shutdown() is a blocking call (normally, an Actor should never use ActorSystem() methods; those are only intended for applications outside of the actor framework... more on this below). The ActorSystem will eventually time out and shutdown, at which point the call returns and the Actor can run and complete.
  • If the ActorSystem.shutdown() is initiated from the ActorSystemShutdown, a similar effect occurs, although if it wasn't blocked, it would continue looping forever.

As I stated above, this actually worked, but it's not desireable because it's not very clean, and it does violate the admonition to not call ActorSystem methods from an Actor itself. A better solution would be to provide a ".shutdownSystem()" method on the Actor itself, but I want to synchronize on the behaviour of the above methods first to ensure that I've captured the correct functionality. Please let me know what happens when you run the above, and send me the contents of ${TMPDIR}/thespian.log for the period of the run.

Once we've synchronized the behavior we are seeing from the test utility, I think adding the .shutdownSystem() method to the Actor itself is probably reasonable. One thing to be aware of is that it would operate asynchronously, so the call would return to the Actor's receiveMessage() handler and other messages may be delivered (including an ActorExitRequest) before the ActorSystem actually shuts down.

-Kevin

----snip----

from thespian.actors import *
import time
from datetime import timedelta

ask_wait = timedelta(milliseconds=350)

class Killer(Actor):
    def receiveMessage(self, msg, sender):
        if msg == 'ready?':
            self.send(sender, 'I am ready')
        elif msg == 'kill system':
            asys = ActorSystem(systemBase='multiprocTCPBase')
            asys.shutdown()
        elif isinstance(msg, ActorExitRequest):
            asys = ActorSystem(systemBase='multiprocTCPBase')
            asys.shutdown()


def test_actorShutdownSystem():

    asys = ActorSystem(systemBase='multiprocTCPBase')
    killer = asys.createActor(Killer)

    check = asys.ask(killer, "ready?", ask_wait)
    assert "I am ready" == check

    #asys.tell(killer, 'kill system')
    asys.tell(killer, ActorExitRequest())
    time.sleep(2)
    r = asys.ask(killer, 'ready?', ask_wait)
    assert None == r


if __name__ == "__main__":
    test_actorShutdownSystem()
Contributor

kwquick commented Nov 8, 2016

Hi David,

I apologize for the mixup: I had sent you a reply but had done it by email "reply" to a message in the middle of the chain, rather than the last message, not realizing that github would ignore a reply that didn't occur at the end. I'm posting my original reply below directly via github (and you can ignore the original if it ever pops of the ether):

---- original message follows ----

I've done some testing and while it's sub-optimal, it does basically work for me. I've included a little test utility below that I used; feel free to update this to match your usage scenario.

The utility also has two ways to trigger the shutdown (one line is commented out, feel free to uncomment it and comment out the following line).

  • If the ActorSystem.shutdown() is initiated from a regular message, the ActorSystem tries to send an ActorExitRequest to the Actor, which it cannot respond to because the shutdown() is a blocking call (normally, an Actor should never use ActorSystem() methods; those are only intended for applications outside of the actor framework... more on this below). The ActorSystem will eventually time out and shutdown, at which point the call returns and the Actor can run and complete.
  • If the ActorSystem.shutdown() is initiated from the ActorSystemShutdown, a similar effect occurs, although if it wasn't blocked, it would continue looping forever.

As I stated above, this actually worked, but it's not desireable because it's not very clean, and it does violate the admonition to not call ActorSystem methods from an Actor itself. A better solution would be to provide a ".shutdownSystem()" method on the Actor itself, but I want to synchronize on the behaviour of the above methods first to ensure that I've captured the correct functionality. Please let me know what happens when you run the above, and send me the contents of ${TMPDIR}/thespian.log for the period of the run.

Once we've synchronized the behavior we are seeing from the test utility, I think adding the .shutdownSystem() method to the Actor itself is probably reasonable. One thing to be aware of is that it would operate asynchronously, so the call would return to the Actor's receiveMessage() handler and other messages may be delivered (including an ActorExitRequest) before the ActorSystem actually shuts down.

-Kevin

----snip----

from thespian.actors import *
import time
from datetime import timedelta

ask_wait = timedelta(milliseconds=350)

class Killer(Actor):
    def receiveMessage(self, msg, sender):
        if msg == 'ready?':
            self.send(sender, 'I am ready')
        elif msg == 'kill system':
            asys = ActorSystem(systemBase='multiprocTCPBase')
            asys.shutdown()
        elif isinstance(msg, ActorExitRequest):
            asys = ActorSystem(systemBase='multiprocTCPBase')
            asys.shutdown()


def test_actorShutdownSystem():

    asys = ActorSystem(systemBase='multiprocTCPBase')
    killer = asys.createActor(Killer)

    check = asys.ask(killer, "ready?", ask_wait)
    assert "I am ready" == check

    #asys.tell(killer, 'kill system')
    asys.tell(killer, ActorExitRequest())
    time.sleep(2)
    r = asys.ask(killer, 'ready?', ask_wait)
    assert None == r


if __name__ == "__main__":
    test_actorShutdownSystem()
@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Nov 8, 2016

First, I've a modified the snippet a little bit, because in the killer actor a new instance of the ActorSystem was created (due to the systemBase argument) on the same address that caused an exception. Thus, I've removed the systemBase argument in both constructor invocations in the Killer actor.

Finally, I've got the attached results.

This is the same situation what I've ran into in my own system. I'm looking forward to a shutDownSystem method in the actor class (or in a special, dedicated actor class).

First, I've a modified the snippet a little bit, because in the killer actor a new instance of the ActorSystem was created (due to the systemBase argument) on the same address that caused an exception. Thus, I've removed the systemBase argument in both constructor invocations in the Killer actor.

Finally, I've got the attached results.

This is the same situation what I've ran into in my own system. I'm looking forward to a shutDownSystem method in the actor class (or in a special, dedicated actor class).

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Nov 9, 2016

Contributor

Hi David,

You are getting what I would expect to see after your modifications, although your modifications didn't work quite as expected. I'm really interested in the exception you were getting with the original snippet above, because I don't get an exception and I would like to understand what is happening differently for you. I'm working on the self.shutdownSystem() method for an actor for you, but your exception is unexpected and concerns me.

Details regarding your results (feel free to skip):

By default, the ActorSystem() call creates a process-level singleton (briefly mentioned here, but I can see by your situation that I could explain this a little better: http://thespianpy.com/using.html#sec-6-2). Once the singleton has been created, any other calls to ActorSystem() will essentially ignore their arguments and use the previously-created singleton instance, which is what you were expecting in your tests when you removed the systemBase arguments in the killer actor.

Unfortunately, the killer actor is created as a separate process (because the test_actorShutdownSystem function created a "multiproc..." actor system base), and the killer actor's process does not have the same singleton instance. In the form of the test where the killer does not specify a systemBase, and it does not have visibility to the singleton, it will create a new Actor System, which is actually the simpleSystemBase (http://thespianpy.com/using.html#sec-8-2). Because the simpleSystemBase runs in the context of the current process instead of creating different processes, it ran in the context of the killer actor and tried to initialize logging in a manner that is not supported in actors because actors use log forwarding in a multiprocess configuration (http://thespianpy.com/using.html#sec-4-12). This is the cause of the exception in both test cases.

The processes you had leftover were:

  • Case 1: the multiprocTCPBase admin and the logger (the killer actor is not present because the ActorExitRequest processing killed it). No shutdown of the admin and the logger were requested because the exception occurred prior to that point.
  • Case 2: the multiprocTCPBase admin, the logger, and the killer actor were leftover. The first two for the same reason as above and the killer because there was no send of an ActorExitRequest to it.

BTW, if you install the setproctitle Python package, Thespian will automatically detect it and will modify the processes names under Linux. Here's an example from Case 2 running your version of the test with setproctitle installed:

$ ps -ef | grep -i [a]ctorad                                                                   
kquick   24631     1  0 17:20 ?        00:00:00 MultiProcAdmin ActorAddr-(TCP|192.168.56.9:1900)
kquick   24632 24631  0 17:20 ?        00:00:00 logger ActorAddr-(TCP|192.168.56.9:42776)
kquick   24633 24631  0 17:20 ?        00:00:00 __main__.Killer ActorAddr-(TCP|192.168.56.9:41124)

If you run the original version of the tests above with the systemBase specified in the killer actor, then that should start creating a new multiprocTCPBase system, which will look for an existing multiprocTCPBase admin; in this case it should find the original one created and simply use that, thereby operating with the system-wide "singleton" instance. This is normal and expected behavior, and the fact that you got an exception instead is concerning.

Here's the thespian.log output for me running case 2 with the systemBase specified in the killer actor:

2016-11-08 18:01:02.761269 p24717 I    ++++ Admin started @ ActorAddr-(TCP|192.168.56.9:1900) / gen (3, 4)
2016-11-08 18:01:02.771972 p24717 I    Pending Actor request received for __main__.Killer reqs None from ActorAddr-(TCP|192.168.56.9:37837)
2016-11-08 18:01:02.775875 p24719 I    Starting Actor __main__.Killer at ActorAddr-(TCP|192.168.56.9:41607) (parent ActorAddr-(TCP|192.168.56.9:1900), admin ActorAddr-(TCP|192.168.56.9:1900))
2016-11-08 18:01:02.790514 p24719 I    ActorSystem shutdown requested.
2016-11-08 18:01:12.799831 p24719 ERR  No response to Admin shutdown request; Actor system not completely shutdown
2016-11-08 18:01:12.799935 p24719 I    ActorSystem shutdown complete.
2016-11-08 18:01:23.862602 p24719 I    Handling exception on msg "ActorExitRequest": ActorAddr-(TCP|192.168.56.9:1900) is not a valid or useable ActorSystem Admin
2016-11-08 18:01:24.043384 p24717 I    ---- shutdown completed
2016-11-08 18:01:24.043704 p24717 ERR  ConnRefused to ActorAddr-(TCP|192.168.56.9:42397); declaring as DeadTarget.
2016-11-08 18:01:24.043866 p24717 I    completion error: ************* TransportIntent(ActorAddr-(TCP|192.168.56.9:42397)-=-SENDSTS_DEADTARGET-<class 'thespian.system.messages.admin.SystemShutdownCompleted'>-<thespian.system.messages.admin.SystemShutdownCompleted object at 0x7f37bdec1898>-quit_0:04:59.999003)

It's notable that there is a 10 second delay before the first ERR line: this is the amount of time that an ActorSystem will gracefully wait for all actors to stop before giving up and exiting with this error. There is another 12 second delay before the TCP connection fails during the shutdown process and the final cleanup code is run. After this point, there should be no more processes running, but in the 22-second period from the start of the run, there will still be processes if you look. I'm also concerned if you don't see this final cleanup performed and there are still processes running after 30 seconds from starting the test.

Much of the above is pretty complicated and the normal expectation is that the developer writing an actor application using Thespian should not need to know about this level of intricacy (and shouldn't normally need to consult the thespian.log, which is for debugging the internals of Thespian). The proper solution in this case is to provide a self.shutdownSystem() method for actors to call for your previously unforeseen use case, rather than calling ActorSystem() or any method thereof from an actor. I'm working on that shutdownSystem method for you, but the behavior you are describing for the sample code above with systemBase specifications is unexpected, so I would like to address that as well.

Contributor

kwquick commented Nov 9, 2016

Hi David,

You are getting what I would expect to see after your modifications, although your modifications didn't work quite as expected. I'm really interested in the exception you were getting with the original snippet above, because I don't get an exception and I would like to understand what is happening differently for you. I'm working on the self.shutdownSystem() method for an actor for you, but your exception is unexpected and concerns me.

Details regarding your results (feel free to skip):

By default, the ActorSystem() call creates a process-level singleton (briefly mentioned here, but I can see by your situation that I could explain this a little better: http://thespianpy.com/using.html#sec-6-2). Once the singleton has been created, any other calls to ActorSystem() will essentially ignore their arguments and use the previously-created singleton instance, which is what you were expecting in your tests when you removed the systemBase arguments in the killer actor.

Unfortunately, the killer actor is created as a separate process (because the test_actorShutdownSystem function created a "multiproc..." actor system base), and the killer actor's process does not have the same singleton instance. In the form of the test where the killer does not specify a systemBase, and it does not have visibility to the singleton, it will create a new Actor System, which is actually the simpleSystemBase (http://thespianpy.com/using.html#sec-8-2). Because the simpleSystemBase runs in the context of the current process instead of creating different processes, it ran in the context of the killer actor and tried to initialize logging in a manner that is not supported in actors because actors use log forwarding in a multiprocess configuration (http://thespianpy.com/using.html#sec-4-12). This is the cause of the exception in both test cases.

The processes you had leftover were:

  • Case 1: the multiprocTCPBase admin and the logger (the killer actor is not present because the ActorExitRequest processing killed it). No shutdown of the admin and the logger were requested because the exception occurred prior to that point.
  • Case 2: the multiprocTCPBase admin, the logger, and the killer actor were leftover. The first two for the same reason as above and the killer because there was no send of an ActorExitRequest to it.

BTW, if you install the setproctitle Python package, Thespian will automatically detect it and will modify the processes names under Linux. Here's an example from Case 2 running your version of the test with setproctitle installed:

$ ps -ef | grep -i [a]ctorad                                                                   
kquick   24631     1  0 17:20 ?        00:00:00 MultiProcAdmin ActorAddr-(TCP|192.168.56.9:1900)
kquick   24632 24631  0 17:20 ?        00:00:00 logger ActorAddr-(TCP|192.168.56.9:42776)
kquick   24633 24631  0 17:20 ?        00:00:00 __main__.Killer ActorAddr-(TCP|192.168.56.9:41124)

If you run the original version of the tests above with the systemBase specified in the killer actor, then that should start creating a new multiprocTCPBase system, which will look for an existing multiprocTCPBase admin; in this case it should find the original one created and simply use that, thereby operating with the system-wide "singleton" instance. This is normal and expected behavior, and the fact that you got an exception instead is concerning.

Here's the thespian.log output for me running case 2 with the systemBase specified in the killer actor:

2016-11-08 18:01:02.761269 p24717 I    ++++ Admin started @ ActorAddr-(TCP|192.168.56.9:1900) / gen (3, 4)
2016-11-08 18:01:02.771972 p24717 I    Pending Actor request received for __main__.Killer reqs None from ActorAddr-(TCP|192.168.56.9:37837)
2016-11-08 18:01:02.775875 p24719 I    Starting Actor __main__.Killer at ActorAddr-(TCP|192.168.56.9:41607) (parent ActorAddr-(TCP|192.168.56.9:1900), admin ActorAddr-(TCP|192.168.56.9:1900))
2016-11-08 18:01:02.790514 p24719 I    ActorSystem shutdown requested.
2016-11-08 18:01:12.799831 p24719 ERR  No response to Admin shutdown request; Actor system not completely shutdown
2016-11-08 18:01:12.799935 p24719 I    ActorSystem shutdown complete.
2016-11-08 18:01:23.862602 p24719 I    Handling exception on msg "ActorExitRequest": ActorAddr-(TCP|192.168.56.9:1900) is not a valid or useable ActorSystem Admin
2016-11-08 18:01:24.043384 p24717 I    ---- shutdown completed
2016-11-08 18:01:24.043704 p24717 ERR  ConnRefused to ActorAddr-(TCP|192.168.56.9:42397); declaring as DeadTarget.
2016-11-08 18:01:24.043866 p24717 I    completion error: ************* TransportIntent(ActorAddr-(TCP|192.168.56.9:42397)-=-SENDSTS_DEADTARGET-<class 'thespian.system.messages.admin.SystemShutdownCompleted'>-<thespian.system.messages.admin.SystemShutdownCompleted object at 0x7f37bdec1898>-quit_0:04:59.999003)

It's notable that there is a 10 second delay before the first ERR line: this is the amount of time that an ActorSystem will gracefully wait for all actors to stop before giving up and exiting with this error. There is another 12 second delay before the TCP connection fails during the shutdown process and the final cleanup code is run. After this point, there should be no more processes running, but in the 22-second period from the start of the run, there will still be processes if you look. I'm also concerned if you don't see this final cleanup performed and there are still processes running after 30 seconds from starting the test.

Much of the above is pretty complicated and the normal expectation is that the developer writing an actor application using Thespian should not need to know about this level of intricacy (and shouldn't normally need to consult the thespian.log, which is for debugging the internals of Thespian). The proper solution in this case is to provide a self.shutdownSystem() method for actors to call for your previously unforeseen use case, rather than calling ActorSystem() or any method thereof from an actor. I'm working on that shutdownSystem method for you, but the behavior you are describing for the sample code above with systemBase specifications is unexpected, so I would like to address that as well.

@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Nov 9, 2016

Hi Kevin,

First of all, I am using a Windows 7 environment with Python 3.4.4. In my case, when the systemBase argument is given to the ActorSystem constructor, it also tries creating another system (see attached log of running case 2 with systemBase arguments) on the same port, which causes an exception (InvalidActorAddress for admin actor).

case2_w_systemBase.txt

If you need further tests, let me know and I'll try it. I'm also really looking forward to the self.shutdownSystem() method.

Hi Kevin,

First of all, I am using a Windows 7 environment with Python 3.4.4. In my case, when the systemBase argument is given to the ActorSystem constructor, it also tries creating another system (see attached log of running case 2 with systemBase arguments) on the same port, which causes an exception (InvalidActorAddress for admin actor).

case2_w_systemBase.txt

If you need further tests, let me know and I'll try it. I'm also really looking forward to the self.shutdownSystem() method.

kwquick added a commit that referenced this issue Nov 10, 2016

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Nov 10, 2016

Contributor

David,

Thanks for the additional info: this did reveal a windows-related bug that should be fixed now in b397ce6.

In addition, I've uploaded the self.actorSystemShutdown() modification (7a9e0b4). Please let me know if this works for you and if so I'll include it in the next release.

Thanks,
Kevin

Contributor

kwquick commented Nov 10, 2016

David,

Thanks for the additional info: this did reveal a windows-related bug that should be fixed now in b397ce6.

In addition, I've uploaded the self.actorSystemShutdown() modification (7a9e0b4). Please let me know if this works for you and if so I'll include it in the next release.

Thanks,
Kevin

@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Nov 11, 2016

Hi Kevin,

Both the ActorSystem constructor and the actorSystemShutdown method work as expected! I really appreciate your help, thanks.

Hi Kevin,

Both the ActorSystem constructor and the actorSystemShutdown method work as expected! I really appreciate your help, thanks.

@jnkramer3

This comment has been minimized.

Show comment
Hide comment
@jnkramer3

jnkramer3 Nov 30, 2016

David,

This thread started with the idea connecting different remote actor systems.
You also indicated that you have got something working.
I was interested if you had a simple example you could share of this working.
I am trying to decide if I want to pursue using Thespian for a project I am thinking of.
Without this ability it's clearly a no go.
If I could get a collection of actor systems working together it would go a long way to helping me decide.

Any help would be appreciated.
Thanks.

David,

This thread started with the idea connecting different remote actor systems.
You also indicated that you have got something working.
I was interested if you had a simple example you could share of this working.
I am trying to decide if I want to pursue using Thespian for a project I am thinking of.
Without this ability it's clearly a no go.
If I could get a collection of actor systems working together it would go a long way to helping me decide.

Any help would be appreciated.
Thanks.

@davidhonfi

This comment has been minimized.

Show comment
Hide comment
@davidhonfi

davidhonfi Dec 1, 2016

Hi,

Exactly, I was able to get remote actors to communicate with each other using a convention. An example is provided in this directory.

The key idea is to have a convention leader actor system to which the remotes can connect. If this happens, a special actor in the leader actor system can get notifications about it (using notifyOnSystemRegistrationChanges and handling the ActorSystemConventionUpdate messages, see the document ). If you want to create a new actor on the newly joined remote actor system, then you should use the actorSystemCapabilityCheck with a unique identifier of each remote actor system (e.g., its IP or a unique name).

Hi,

Exactly, I was able to get remote actors to communicate with each other using a convention. An example is provided in this directory.

The key idea is to have a convention leader actor system to which the remotes can connect. If this happens, a special actor in the leader actor system can get notifications about it (using notifyOnSystemRegistrationChanges and handling the ActorSystemConventionUpdate messages, see the document ). If you want to create a new actor on the newly joined remote actor system, then you should use the actorSystemCapabilityCheck with a unique identifier of each remote actor system (e.g., its IP or a unique name).

@kquick

This comment has been minimized.

Show comment
Hide comment
@kquick

kquick Dec 5, 2016

@jnkramer3 - hopefully David's response above is helpful for your decision. I am closing this issue but please feel free to create a new issue or reach out on the Thespian mailing list (https://groups.google.com/forum/#!forum/thespianpy) if you would like further help in your analysis or design.

kquick commented Dec 5, 2016

@jnkramer3 - hopefully David's response above is helpful for your decision. I am closing this issue but please feel free to create a new issue or reach out on the Thespian mailing list (https://groups.google.com/forum/#!forum/thespianpy) if you would like further help in your analysis or design.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Dec 5, 2016

Contributor

The new self.actorSytemShutdown() has been included in the latest Thespian 3.5.0 release (https://github.com/godaddy/Thespian/releases/tag/thespian-3.5.0); closing this issue.

Contributor

kwquick commented Dec 5, 2016

The new self.actorSytemShutdown() has been included in the latest Thespian 3.5.0 release (https://github.com/godaddy/Thespian/releases/tag/thespian-3.5.0); closing this issue.

@kwquick kwquick closed this Dec 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment