Runtime and ValueError in TCPTransport.py #35

Closed
davinellulinvega opened this Issue Feb 2, 2017 · 17 comments

Comments

Projects
None yet
2 participants
@davinellulinvega

Dear Sir,

First of all thank you very much for making thespian available to the public, it is a very nice and easy to use framework.

At the moment I am trying to implement a simple online game using both thespian (as the 'back-end') and Django (as the 'front-end'). In the game only 4 players can interact at a time. A django view has been set up to handle the interaction between the remote player and the back-end. Since django views are stateless, I have implemented a Lookup actor and initialised it as a named actor when launching thespian's ActorSystem. The lookup actor works as one might expect, it receives a special code attributed to each game, look for the corresponding ActorAddress and sends the address back to the view, so that the view can communicate with the game.
Game actors are created on demand and simply treat players' requests.
So far I have been able to create complete games and have different remote players connect to the game. However, once the game begins to play, the view encounters a Runtime Error, closely followed by a Value Error in the TCPTransport.py file.
I have attached the details of the error to this message, in hope that you could help me understand what is going on. Does the exception happen because of an implementation error or is it internal to Thespian?

Thank you very much in advance for your help.
Sincerely.

dj_err.txt

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 2, 2017

Contributor

Thank you for this report and your kind remarks.

While I can't speak to the view's exception, it shouldn't cause those issues in Thespian's TCPTransport. I will try to fix this very soon, although I may need your assistance in validating the fix since nothing in the current test suite has revealed this issue. If you have any way of creating a simple scenario that reproduces this problem I would love to get that to help ensure the issues are resolved and introduct that into the test suite.

-Kevin

Contributor

kwquick commented Feb 2, 2017

Thank you for this report and your kind remarks.

While I can't speak to the view's exception, it shouldn't cause those issues in Thespian's TCPTransport. I will try to fix this very soon, although I may need your assistance in validating the fix since nothing in the current test suite has revealed this issue. If you have any way of creating a simple scenario that reproduces this problem I would love to get that to help ensure the issues are resolved and introduct that into the test suite.

-Kevin

@kwquick kwquick added the bug label Feb 2, 2017

@davinellulinvega

This comment has been minimized.

Show comment
Hide comment
@davinellulinvega

davinellulinvega Feb 2, 2017

Hello Kevin,

Thank you very much for answering in such a short amount of time.
Since my last message I have been able to run some more tests and it seems that the error is indeed internal to Thespian. Here are the two lines that appear in Thespian's logs:

2017-02-02 20:34:03.979934 p24239 ERR  ValueError on select(#7: [20, 21, 19, 9, 23, -1, 5], #1: [18], #8: {5, 9, 18, 19, 20, 21, 23, -1}, 0.018026)

2017-02-02 20:34:04.131806 p24239 ERR  Socket error sending to ActorAddr-(T|:1900) on <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor / 9: ************* TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.999747-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#25_of_Noneis"PlayerLookup"-quit_0:04:59.999742)

2017-02-02 20:34:04.131914 p24239 I    Error during shutdown of socket <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor

Could this be due to the fact that the views are constantly asking for the actor system to create named actors?

Concerning the creation of a simple scenario reproducing the problem, I would be happy to help, but since I do not understand exactly where the error happens I cannot provide you with any source code at the moment. If you have any explanation as to why this is happening, I might be able to send you a simplified version of my code.

Thank you very much in advance for your help.
Sincerely.

davinellulinvega commented Feb 2, 2017

Hello Kevin,

Thank you very much for answering in such a short amount of time.
Since my last message I have been able to run some more tests and it seems that the error is indeed internal to Thespian. Here are the two lines that appear in Thespian's logs:

2017-02-02 20:34:03.979934 p24239 ERR  ValueError on select(#7: [20, 21, 19, 9, 23, -1, 5], #1: [18], #8: {5, 9, 18, 19, 20, 21, 23, -1}, 0.018026)

2017-02-02 20:34:04.131806 p24239 ERR  Socket error sending to ActorAddr-(T|:1900) on <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor / 9: ************* TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.999747-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#25_of_Noneis"PlayerLookup"-quit_0:04:59.999742)

2017-02-02 20:34:04.131914 p24239 I    Error during shutdown of socket <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor

Could this be due to the fact that the views are constantly asking for the actor system to create named actors?

Concerning the creation of a simple scenario reproducing the problem, I would be happy to help, but since I do not understand exactly where the error happens I cannot provide you with any source code at the moment. If you have any explanation as to why this is happening, I might be able to send you a simplified version of my code.

Thank you very much in advance for your help.
Sincerely.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 2, 2017

Contributor

I have been able to reproduce your problem locally, so thank you but I don't need you to supply a sample application.

Your architecture should be fine (creating named actors), this is a bug with the actor system calls being inadequately protected from invocation by multiple threads.

Contributor

kwquick commented Feb 2, 2017

I have been able to reproduce your problem locally, so thank you but I don't need you to supply a sample application.

Your architecture should be fine (creating named actors), this is a bug with the actor system calls being inadequately protected from invocation by multiple threads.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 3, 2017

Contributor

Status:

  • Reproduce issue
  • New unit test to reproduce issue
  • Research and understand problem
  • Determine solution
  • Implement solution fix
  • Regression testing
  • Bugfix release (ETA: by Saturday, MST)
Contributor

kwquick commented Feb 3, 2017

Status:

  • Reproduce issue
  • New unit test to reproduce issue
  • Research and understand problem
  • Determine solution
  • Implement solution fix
  • Regression testing
  • Bugfix release (ETA: by Saturday, MST)
@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 8, 2017

Contributor

Just an update: the fix is being worked on. The proper solution involves some enhancements in handling multi-threaded external environments which necessitate some careful updates and testing. I apologize for the delay, but hope to have this finished in the next couple of days.

Contributor

kwquick commented Feb 8, 2017

Just an update: the fix is being worked on. The proper solution involves some enhancements in handling multi-threaded external environments which necessitate some careful updates and testing. I apologize for the delay, but hope to have this finished in the next couple of days.

@davinellulinvega

This comment has been minimized.

Show comment
Hide comment
@davinellulinvega

davinellulinvega Feb 8, 2017

Thank you very much for the update.
And good luck for the implementation and testing.

Thank you very much for the update.
And good luck for the implementation and testing.

@davinellulinvega

This comment has been minimized.

Show comment
Hide comment
@davinellulinvega

davinellulinvega Feb 17, 2017

Hello Kevin,

Forgive me for asking so directly, but I would like to know in what state is the fix for the bug described here? The deadline for my project, using thespian, is drawing to a near and still have some testing of my own to do? How long do you expect the development to last?

Thank you very much for your help so forth.
Sincerely.

Hello Kevin,

Forgive me for asking so directly, but I would like to know in what state is the fix for the bug described here? The deadline for my project, using thespian, is drawing to a near and still have some testing of my own to do? How long do you expect the development to last?

Thank you very much for your help so forth.
Sincerely.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 18, 2017

Contributor

It's no problem to ask for an update, and I do apologize for the delay. Working through this issue cascaded into revealing two other problems in the code, and getting the flu didn't help the speed either. I believe I have everything mostly handled at this point, and am working through the last set of tests. I promise to give an update within 24 hours, and hopefully that update will take the form of a release.

The core issue is that you are using a multi-threaded client application to call the ask(), but the ask() framework is process-global; aside from the areas that had thread issues, the results would still not be very useful to you in the original form because there was no way to correlate the responses to the threads that generated the requests. In order to fix this, I have introduced new functionality for multi-threaded clients (non-multithreaded clients can operate just as before without change): the with statement can be used with the ActorSystem.thread_private returned context object to allow ActorSystem operations (ask(), tell(), and listen()) on that context distinctly and separately from other threads.

def client_init(...):
    asys = ActorSystem(...)

def thread_one_run(...):
    with asys.private() as thrd_asys:
        response1 = thrd_asys.ask(target_a, msg_1, ask_timeout)

def thread_two_run(...):
    with asys.private() as thrd_asys:
        response2 = thrd_asys.ask(target_b, msg_2, ask_timeout)

Using the above code, response1 will be the result of target_a sending back to its sender, and response2 will be the result of target_b sending back to its sender. This will even work if target_a and target_b are the same actor, because the private context has a unique local address.

This will necessitate a (hopefully very small) change to your code, but should give you the reliable functionality you are looking for. Please let me know if the above will work for you or if it will be problematic for you to work with.

Regards,
Kevin

Contributor

kwquick commented Feb 18, 2017

It's no problem to ask for an update, and I do apologize for the delay. Working through this issue cascaded into revealing two other problems in the code, and getting the flu didn't help the speed either. I believe I have everything mostly handled at this point, and am working through the last set of tests. I promise to give an update within 24 hours, and hopefully that update will take the form of a release.

The core issue is that you are using a multi-threaded client application to call the ask(), but the ask() framework is process-global; aside from the areas that had thread issues, the results would still not be very useful to you in the original form because there was no way to correlate the responses to the threads that generated the requests. In order to fix this, I have introduced new functionality for multi-threaded clients (non-multithreaded clients can operate just as before without change): the with statement can be used with the ActorSystem.thread_private returned context object to allow ActorSystem operations (ask(), tell(), and listen()) on that context distinctly and separately from other threads.

def client_init(...):
    asys = ActorSystem(...)

def thread_one_run(...):
    with asys.private() as thrd_asys:
        response1 = thrd_asys.ask(target_a, msg_1, ask_timeout)

def thread_two_run(...):
    with asys.private() as thrd_asys:
        response2 = thrd_asys.ask(target_b, msg_2, ask_timeout)

Using the above code, response1 will be the result of target_a sending back to its sender, and response2 will be the result of target_b sending back to its sender. This will even work if target_a and target_b are the same actor, because the private context has a unique local address.

This will necessitate a (hopefully very small) change to your code, but should give you the reliable functionality you are looking for. Please let me know if the above will work for you or if it will be problematic for you to work with.

Regards,
Kevin

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 19, 2017

Contributor

Update: all changes are good except for the simple system base, testing looks good so far. I will create a branch within the next couple of hours for you to validate that this resolves your problem as well.

Contributor

kwquick commented Feb 19, 2017

Update: all changes are good except for the simple system base, testing looks good so far. I will create a branch within the next couple of hours for you to validate that this resolves your problem as well.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 20, 2017

Contributor

I have created a branch (https://github.com/godaddy/Thespian/tree/add_system_private) with the changes described above. Please take a look and let me know if this resolves your issue (using the .private() context as described in my previous update). If it does, I will generate a new release containing these changes.

Also note that I just released Thespian 3.6.2, which is a bugfix release that contains solutions for some of the issues that were revealed in the work on this .private() context support, but the 3.6.2 release does not contain the .private() context. Since that is a feature addition, that would go into a 3.7 release.

Regards,
Kevin

Contributor

kwquick commented Feb 20, 2017

I have created a branch (https://github.com/godaddy/Thespian/tree/add_system_private) with the changes described above. Please take a look and let me know if this resolves your issue (using the .private() context as described in my previous update). If it does, I will generate a new release containing these changes.

Also note that I just released Thespian 3.6.2, which is a bugfix release that contains solutions for some of the issues that were revealed in the work on this .private() context support, but the 3.6.2 release does not contain the .private() context. Since that is a feature addition, that would go into a 3.7 release.

Regards,
Kevin

@davinellulinvega

This comment has been minimized.

Show comment
Hide comment
@davinellulinvega

davinellulinvega Feb 20, 2017

Hello Kevin,

Thank you very much for the update. I will try solution you implemented in the new branch and let you know if it solves the problem, as soon as possible.
The changes introduced by the new .private() context should be minimal and testing is scheduled to be finished by the end of the day.

Sincerely.

Hello Kevin,

Thank you very much for the update. I will try solution you implemented in the new branch and let you know if it solves the problem, as soon as possible.
The changes introduced by the new .private() context should be minimal and testing is scheduled to be finished by the end of the day.

Sincerely.

@davinellulinvega

This comment has been minimized.

Show comment
Hide comment
@davinellulinvega

davinellulinvega Feb 20, 2017

Hello again,

I have made all the necessary changes for my implementation to use the .private() context inside the Django client (where the call to ask() is done).
After some testing, it seems everything is working well. I do not get any RuntimeError any more.
Thank you very much for your help on this issue.

Sincerely.

Hello again,

I have made all the necessary changes for my implementation to use the .private() context inside the Django client (where the call to ask() is done).
After some testing, it seems everything is working well. I do not get any RuntimeError any more.
Thank you very much for your help on this issue.

Sincerely.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Feb 21, 2017

Contributor

Excellent news! I will create a release in the next couple of days with this officially supported.

Contributor

kwquick commented Feb 21, 2017

Excellent news! I will create a release in the next couple of days with this officially supported.

@davinellulinvega

This comment has been minimized.

Show comment
Hide comment
@davinellulinvega

davinellulinvega Feb 28, 2017

Hello Kevin,

After some more extensive testing and debugging on my side of the project, I can confirm that the solution you implemented works perfectly.
Once again, I thank you very much for your help on this issue.
If it is alright with you, I will let you close the discussion once you have created the new release?

Sincerely.

Hello Kevin,

After some more extensive testing and debugging on my side of the project, I can confirm that the solution you implemented works perfectly.
Once again, I thank you very much for your help on this issue.
If it is alright with you, I will let you close the discussion once you have created the new release?

Sincerely.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Mar 2, 2017

Contributor

Thank you for the confirmation.

I delayed releasing this feature to address a regression in the 3.6.2 release. I believe I identified the source of the regression and have just released 3.6.3 with the offending commit reverted. I would like to wait 48 hours to see if the 3.6.3 improves the 3.6.2 situation, and if it does, I will make a 3.7.0 release containing the 3.6.3 fix as well as this new functionality.

Contributor

kwquick commented Mar 2, 2017

Thank you for the confirmation.

I delayed releasing this feature to address a regression in the 3.6.2 release. I believe I identified the source of the regression and have just released 3.6.3 with the offending commit reverted. I would like to wait 48 hours to see if the 3.6.3 improves the 3.6.2 situation, and if it does, I will make a 3.7.0 release containing the 3.6.3 fix as well as this new functionality.

@kwquick

This comment has been minimized.

Show comment
Hide comment
@kwquick

kwquick Mar 21, 2017

Contributor

Thank you for your patience on this issue. This feature release was delayed to address a bug regression in 3.6.2, to evaluate some other possible issues, and because of some personal travel. However, I'm pleased to announce that the 3.7.0 release (https://github.com/godaddy/Thespian/releases/tag/thespian-3.7.0) is now available and contains this update.

Contributor

kwquick commented Mar 21, 2017

Thank you for your patience on this issue. This feature release was delayed to address a bug regression in 3.6.2, to evaluate some other possible issues, and because of some personal travel. However, I'm pleased to announce that the 3.7.0 release (https://github.com/godaddy/Thespian/releases/tag/thespian-3.7.0) is now available and contains this update.

@kwquick kwquick closed this Mar 21, 2017

@davinellulinvega

This comment has been minimized.

Show comment
Hide comment
@davinellulinvega

davinellulinvega Mar 21, 2017

Hello,

Thank you very much for the notification. I will make sure to update the Thespian package on my machines.

Sincerely.

Hello,

Thank you very much for the notification. I will make sure to update the Thespian package on my machines.

Sincerely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment