New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various fixes and improvements to the web components #837
Conversation
@THS-on BTW, what are your thoughts of taking advantage of the work done during kind of changes, and provide unit tests for them? Is a bit more more, but also those changes can be split in smaller commits. |
@aplanas adding more unit tests is a good idea. Especially now that we refactor some parts of Keylime it is nice to have some confidence that the behaviour did not change. |
I am of the same opinion! I added the tests in my last PR that changed the |
I agree with @aplanas. This change looks good (better organization and less duplication) and will be approved after tests are added. |
The code is shared between registrar and verifier. Signed-off-by: Thore Sommer <mail@thson.de>
tornado_response was renamed to TornadoResponse. The json and yaml methods were removed because they were unsed. Signed-off-by: Thore Sommer <mail@thson.de>
The response object might not have error as a member. Signed-off-by: Thore Sommer <mail@thson.de>
This functions is not required for the Keylime configuration, so it is moved into the web_util module. Signed-off-by: Thore Sommer <mail@thson.de>
Removed busy waiting for the threads and removed abstractions for starting and stopping threads/servers. Signed-off-by: Thore Sommer <mail@thson.de>
Hello @THS-on , regarding the following test failure:
This looks similar to your former PR. In this case agent didn't stop in 3 seconds after receiving SIGTERM so it has been killed with SIGKILL. In your former PR there was a traceback after SIGINT (not SIGTERM) but it was slightly different situation. For few days I was using in tests SIGINT to stop the process (since SIGINT has to be used to terminate a process when executed via the coverage script) but later I have changed it to use SIGINT only for coverage script and SIGTERM directly otherwise. Just now I have checked how does your new code behave when SIGINT is sent and the output was:
Not sure if that helps. EDIT: Btw, for the SIGTERM case the last message in agent.log was
EDIT2: Looking at timestamps above, maybe 3 seconds are simply not enough. I will check it. EDIT3: Even waiting 15 seconds didn't make a difference. Related tests PR RedHat-SP-Security/keylime-tests#29 |
7fe6898
to
4402ef9
Compare
@kkaarreell thank you for the detailed information. Can I get this output log also from the CI or do I have to run the tests locally for that. It seems that there is still some thread running. I join all the threads/processes that I started, so I don't know what thread might cause this. Do you have a simple way to reproduce it outside the tests because if I just start the agent and stop it with SIGINT it works on my machine? |
Hello @THS-on , unfortunately agent.log is not available through CI, Packit interface doesn't allow to upload attachments. Also, I believe that agent "freeze" has something to do the with the tested scenario (maybe something about revocation?) since the 2nd test is not failing this way and agent stops properly. I will do more tests today in order to identify the relevant test scenario that makes such a difference. Btw, would it help to print e.g. last 20 lines from service log when service stop fails? |
Awesome, thank you.
That might be helpful to at least get an idea where in the process it failed. |
Hi @THS-on , according to my testing it is |
@kkaarreell thank you for finding this. I found out that the payload thread was not terminating correctly. This is now fixed by marking this thread as a daemon and fixing the infinity loop. |
The revocation notifier is now started in a separate process and shutdown is now handled a signal handler. Signed-off-by: Thore Sommer <mail@thson.de>
A timeout might cause a HTTP 500 response. Signed-off-by: Thore Sommer <mail@thson.de>
Signed-off-by: Thore Sommer <mail@thson.de>
Signed-off-by: Thore Sommer <mail@thson.de>
The payload thread is now marked as a daemon, so that it is terminated correctly on shutdown. Fixed that the logging of the process caused an infinity loop. Signed-off-by: Thore Sommer <mail@thson.de>
@@ -24,6 +24,8 @@ | |||
import keylime_logging | |||
import common | |||
import sys | |||
|
|||
import keylime.web_util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems that the used idiom in Keylime is from keylime import web_util
, and use web_util.echo_json_response()
in the code.
I do not mind if stay as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I don't know if this demo code actually works this only changed by the automatic refactoring.
@@ -723,7 +724,7 @@ async def invoke_get_quote(agent, need_pubkey): | |||
|
|||
if response.status_code != 200: | |||
# this is a connection error, retry get quote | |||
if response.status_code == 599: | |||
if response.status_code in [500, 599]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that process_agent
will take care of limit the number of retries, but 500
seems a serious issue in the server. Do makes sense to retry on internal error? Do you think that will be a transient issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that socket timeouts are treated as 500
status codes. If we find another way to distinguish them from other 500
issues I'm happy to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been seeing a growing number of transient 500
status codes which are exactly sockets timeouts, at scale (with hundreds to thousands of agents
). It is imperative we retry in case we get the aforementioned status code.
@@ -338,7 +337,7 @@ def initthread(): | |||
logger.info("No payload script %s found in %s/unzipped", initscript, secdir) | |||
else: | |||
logger.info("Executing payload script: %s/unzipped/%s", secdir, initscript) | |||
payload_thread = threading.Thread(target=initthread) | |||
payload_thread = threading.Thread(target=initthread, daemon=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure here. Maybe this is needed and correct, but I feel that making it possible that the Python code can ends and exit before the thread is done, can cause problems.
For example, I am not sure what will happen if the agent throws the payload script in a thread, that is executed inside {secdir}/zypper
, and we end the agent that maybe will try to umount {secdir}
. This can make the umount to fail and leave the {secdir}
present in the system, or cause problems in the payload script later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right. This fix is not totally optimal, but do we want the agent to hang if the payload script does not terminate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but do we want the agent to hang if the payload script does not terminate?
Right ...
Maybe this future umount
should be using the --lazy
parameter.
signal.signal(signal.SIGINT, shutdown_handler) | ||
|
||
# Keep the main thread alive by waiting for the server thread | ||
serverthread.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do revocation_process.join()
?
maybe we can register the payload script thread too, and join it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really because this will already keep the main thread active until we call the shutdown handler.
@mpeters I think this is now ready to merge. |
Initially the idea was to convert all the components to the Tornado framework with #835, but moving the current threaded code to async did cause more problems than it solved.
This PR takes the general improvements from #835 without changing the components to Tornado.