Various fixes and improvements to the web components #837

THS-on · 2022-01-16T11:36:12Z

Initially the idea was to convert all the components to the Tornado framework with #835, but moving the current threaded code to async did cause more problems than it solved.

This PR takes the general improvements from #835 without changing the components to Tornado.

THS-on · 2022-01-17T09:25:42Z

@aplanas in this PR the web related functions are moved from config.py to web_util.py, so you don't have to move them for #831.

aplanas · 2022-01-17T09:27:59Z

@aplanas in this PR the web related functions are moved from config.py to web_util.py, so you don't have to move them for #831.

Oh very cool. I paused #831 for now, and I will rebase it on top of this one.

aplanas · 2022-01-17T09:31:23Z

@THS-on BTW, what are your thoughts of taking advantage of the work done during kind of changes, and provide unit tests for them?

Is a bit more more, but also those changes can be split in smaller commits.

THS-on · 2022-01-17T09:38:43Z

@aplanas adding more unit tests is a good idea. Especially now that we refactor some parts of Keylime it is nice to have some confidence that the behaviour did not change.

aplanas · 2022-01-17T09:50:21Z

@aplanas adding more unit tests is a good idea. Especially now that we refactor some parts of Keylime it is nice to have some confidence that the behaviour did not change.

I am of the same opinion! I added the tests in my last PR that changed the config, and I think that a good candidate to test here are the moved echo_json_response, list_to_dict or get_restful_params, and maybe bits of revocation_listener (like extracting the inner function perform_actions and testing it independently) or init_mtls.

mpeters · 2022-01-17T14:55:40Z

I agree with @aplanas. This change looks good (better organization and less duplication) and will be approved after tests are added.

The code is shared between registrar and verifier. Signed-off-by: Thore Sommer <mail@thson.de>

tornado_response was renamed to TornadoResponse. The json and yaml methods were removed because they were unsed. Signed-off-by: Thore Sommer <mail@thson.de>

The response object might not have error as a member. Signed-off-by: Thore Sommer <mail@thson.de>

This functions is not required for the Keylime configuration, so it is moved into the web_util module. Signed-off-by: Thore Sommer <mail@thson.de>

Removed busy waiting for the threads and removed abstractions for starting and stopping threads/servers. Signed-off-by: Thore Sommer <mail@thson.de>

THS-on · 2022-01-18T19:38:53Z

@mpeters @aplanas I've now added tests for echo_json_response and get_restful_params.

kkaarreell · 2022-01-18T21:28:19Z

Hello @THS-on , regarding the following test failure:

:: [ 19:42:36 ] :: [  BEGIN   ] :: Running 'limeStopAgent'
Process wasn't terminated, sending SIGKILL signal...
:: [ 19:42:39 ] :: [   FAIL   ] :: Command 'limeStopAgent' (Expected 0, got 9)
/usr/share/beakerlib/journal.sh: line 1: 16580 Killed                  keylime_${NAME} >> ${LOGFILE} 2>&1

This looks similar to your former PR. In this case agent didn't stop in 3 seconds after receiving SIGTERM so it has been killed with SIGKILL.

In your former PR there was a traceback after SIGINT (not SIGTERM) but it was slightly different situation. For few days I was using in tests SIGINT to stop the process (since SIGINT has to be used to terminate a process when executed via the coverage script) but later I have changed it to use SIGINT only for coverage script and SIGTERM directly otherwise.

Just now I have checked how does your new code behave when SIGINT is sent and the output was:

                out: Sending SIGINT
                out: :: [ 16:23:25 ] :: [  BEGIN   ] :: Running 'limeStopAgent'
                out: 2022-01-18 16:23:24.591 - keylime.local_action_rm_ssh - INFO - my serial: 2, cert serial: 2
                out: 2022-01-18 16:23:25.861 - keylime.cloudagent - INFO - Stopping revocation listener...
                out: 2022-01-18 16:23:25.863 - keylime.cloudagent - INFO - TERM Signal received, shutting down...
                out: 2022-01-18 16:23:28.892 - keylime.cloudagent - INFO - TERM Signal received, shutting down...
                out: 2022-01-18 16:23:28.920 - keylime.tpm - WARNING - Could not read YAML output of tpm2_getcap.
                out: Exception ignored in: <module 'threading' from '/usr/lib64/python3.10/threading.py'>
                out: Traceback (most recent call last):
                out:   File "/usr/lib64/python3.10/threading.py", line 1551, in _shutdown
                out:     lock.acquire()
                out:   File "/usr/local/lib/python3.10/site-packages/keylime-0.0.0-py3.10.egg/keylime/keylime_agent.py", line 710, in shutdown_handler
                out:     sys.exit(0)
                out: SystemExit: 0
                out: :: [ 16:23:31 ] :: [   PASS   ] :: Command 'limeStopAgent' (Expected 0, got 0)

Not sure if that helps.

EDIT: Btw, for the SIGTERM case the last message in agent.log was

keylime.cloudagent - INFO - TERM Signal received, shutting down...

EDIT2: Looking at timestamps above, maybe 3 seconds are simply not enough. I will check it.

EDIT3: Even waiting 15 seconds didn't make a difference. Related tests PR RedHat-SP-Security/keylime-tests#29

THS-on · 2022-01-19T08:09:54Z

@kkaarreell thank you for the detailed information. Can I get this output log also from the CI or do I have to run the tests locally for that.

It seems that there is still some thread running. I join all the threads/processes that I started, so I don't know what thread might cause this.

Do you have a simple way to reproduce it outside the tests because if I just start the agent and stop it with SIGINT it works on my machine?

kkaarreell · 2022-01-19T08:29:47Z

Hello @THS-on , unfortunately agent.log is not available through CI, Packit interface doesn't allow to upload attachments. Also, I believe that agent "freeze" has something to do the with the tested scenario (maybe something about revocation?) since the 2nd test is not failing this way and agent stops properly. I will do more tests today in order to identify the relevant test scenario that makes such a difference.

Btw, would it help to print e.g. last 20 lines from service log when service stop fails?

THS-on · 2022-01-19T08:37:54Z

I will do more tests today in order to identify the relevant test scenario that makes such a difference.

Awesome, thank you.

Btw, would it help to print e.g. last 20 lines from service log when service stop fails?

That might be helpful to at least get an idea where in the process it failed.

kkaarreell · 2022-01-19T10:39:45Z

Hi @THS-on , according to my testing it is autorun.sh script in payload that makes a difference.
In my test it looks like this
https://github.com/RedHat-SP-Security/keylime-tests/blob/main/functional/basic-attestation-on-localhost/payload/autorun.sh
I have confirmed that agent stops properly when this file is excluded from the payload.

THS-on · 2022-01-19T11:09:52Z

@kkaarreell thank you for finding this.

I found out that the payload thread was not terminating correctly. This is now fixed by marking this thread as a daemon and fixing the infinity loop.

The revocation notifier is now started in a separate process and shutdown is now handled a signal handler. Signed-off-by: Thore Sommer <mail@thson.de>

A timeout might cause a HTTP 500 response. Signed-off-by: Thore Sommer <mail@thson.de>

Signed-off-by: Thore Sommer <mail@thson.de>

The payload thread is now marked as a daemon, so that it is terminated correctly on shutdown. Fixed that the logging of the process caused an infinity loop. Signed-off-by: Thore Sommer <mail@thson.de>

aplanas · 2022-01-19T12:42:07Z

demo/agent_monitor/tenant_agent_monitor.py

@@ -24,6 +24,8 @@
 import keylime_logging
 import common
 import sys
+
+import keylime.web_util


nit: seems that the used idiom in Keylime is from keylime import web_util, and use web_util.echo_json_response() in the code.

I do not mind if stay as it is now.

Because I don't know if this demo code actually works this only changed by the automatic refactoring.

aplanas · 2022-01-19T12:47:31Z

keylime/cloud_verifier_tornado.py

@@ -723,7 +724,7 @@ async def invoke_get_quote(agent, need_pubkey):

    if response.status_code != 200:
        # this is a connection error, retry get quote
-        if response.status_code == 599:
+        if response.status_code in [500, 599]:


I see that process_agent will take care of limit the number of retries, but 500 seems a serious issue in the server. Do makes sense to retry on internal error? Do you think that will be a transient issue?

The issue is that socket timeouts are treated as 500 status codes. If we find another way to distinguish them from other 500 issues I'm happy to change this.

I have been seeing a growing number of transient 500 status codes which are exactly sockets timeouts, at scale (with hundreds to thousands of agents). It is imperative we retry in case we get the aforementioned status code.

aplanas · 2022-01-19T13:06:56Z

keylime/keylime_agent.py

@@ -338,7 +337,7 @@ def initthread():
                        logger.info("No payload script %s found in %s/unzipped", initscript, secdir)
                    else:
                        logger.info("Executing payload script: %s/unzipped/%s", secdir, initscript)
-                        payload_thread = threading.Thread(target=initthread)
+                        payload_thread = threading.Thread(target=initthread, daemon=True)


I am not sure here. Maybe this is needed and correct, but I feel that making it possible that the Python code can ends and exit before the thread is done, can cause problems.

For example, I am not sure what will happen if the agent throws the payload script in a thread, that is executed inside {secdir}/zypper, and we end the agent that maybe will try to umount {secdir}. This can make the umount to fail and leave the {secdir} present in the system, or cause problems in the payload script later.

Yeah you are right. This fix is not totally optimal, but do we want the agent to hang if the payload script does not terminate?

but do we want the agent to hang if the payload script does not terminate?

Right ...

Maybe this future umount should be using the --lazy parameter.

aplanas · 2022-01-19T13:14:13Z

keylime/keylime_agent.py

+    signal.signal(signal.SIGINT, shutdown_handler)
+
+    # Keep the main thread alive by waiting for the server thread
+    serverthread.join()


do we need to do revocation_process.join()?

maybe we can register the payload script thread too, and join it here too?

Not really because this will already keep the main thread active until we call the shutdown handler.

THS-on · 2022-01-19T18:03:32Z

@mpeters I think this is now ready to merge.

keylime-bot requested review from jetwhiz, keylime-bot and mpeters January 16, 2022 11:36

THS-on mentioned this pull request Jan 16, 2022

Move all Keylime web components to Tornado #835

Closed

keylime-bot assigned keylime-bot, jetwhiz and mpeters Jan 16, 2022

THS-on changed the title ~~Varous fixes and improvements to the web components~~ Various fixes and improvements to the web components Jan 16, 2022

maugustosilva approved these changes Jan 18, 2022

View reviewed changes

THS-on added 5 commits January 18, 2022 20:33

web_util, verifier: move mTLS SSLContext generation into separate module

d999d37

The code is shared between registrar and verifier. Signed-off-by: Thore Sommer <mail@thson.de>

tornado_requests: cleanup TornadoResponse class

667093c

tornado_response was renamed to TornadoResponse. The json and yaml methods were removed because they were unsed. Signed-off-by: Thore Sommer <mail@thson.de>

verifier: fix failure generation for V key

f3ed24e

The response object might not have error as a member. Signed-off-by: Thore Sommer <mail@thson.de>

web_util: move echo_json_response(..) out of config.py

5068dc3

This functions is not required for the Keylime configuration, so it is moved into the web_util module. Signed-off-by: Thore Sommer <mail@thson.de>

registrar: cleanup start function

11d61dd

Removed busy waiting for the threads and removed abstractions for starting and stopping threads/servers. Signed-off-by: Thore Sommer <mail@thson.de>

THS-on force-pushed the tornado-fixes branch from 845d917 to 99a2a79 Compare January 18, 2022 19:36

THS-on force-pushed the tornado-fixes branch 4 times, most recently from 7fe6898 to 4402ef9 Compare January 19, 2022 07:59

THS-on force-pushed the tornado-fixes branch from 4402ef9 to d28b6db Compare January 19, 2022 08:12

THS-on force-pushed the tornado-fixes branch from 1c8f22c to 29fbbda Compare January 19, 2022 11:08

THS-on added 5 commits January 19, 2022 12:12

agent: improve startup and shutdown

117b4c5

The revocation notifier is now started in a separate process and shutdown is now handled a signal handler. Signed-off-by: Thore Sommer <mail@thson.de>

verifier: Also retry on HTTP 500 status code

9171e9a

A timeout might cause a HTTP 500 response. Signed-off-by: Thore Sommer <mail@thson.de>

config, web_util: move get_restful_params(..) to web_util

b98dfb4

Signed-off-by: Thore Sommer <mail@thson.de>

tests: add initial tests for web_util module

d2d3434

Signed-off-by: Thore Sommer <mail@thson.de>

agent: fix payload execution

263b5e4

The payload thread is now marked as a daemon, so that it is terminated correctly on shutdown. Fixed that the logging of the process caused an infinity loop. Signed-off-by: Thore Sommer <mail@thson.de>

THS-on force-pushed the tornado-fixes branch from 29fbbda to 263b5e4 Compare January 19, 2022 11:12

aplanas reviewed Jan 19, 2022

View reviewed changes

mpeters approved these changes Jan 19, 2022

View reviewed changes

mpeters merged commit f3fb006 into keylime:master Jan 19, 2022

THS-on mentioned this pull request Jan 20, 2022

verifier doesn't send revocation message to remote system #824

Closed

THS-on deleted the tornado-fixes branch January 20, 2022 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various fixes and improvements to the web components #837

Various fixes and improvements to the web components #837

THS-on commented Jan 16, 2022

THS-on commented Jan 17, 2022

aplanas commented Jan 17, 2022

aplanas commented Jan 17, 2022

THS-on commented Jan 17, 2022

aplanas commented Jan 17, 2022

mpeters commented Jan 17, 2022

THS-on commented Jan 18, 2022

kkaarreell commented Jan 18, 2022 •

edited

THS-on commented Jan 19, 2022

kkaarreell commented Jan 19, 2022

THS-on commented Jan 19, 2022

kkaarreell commented Jan 19, 2022

THS-on commented Jan 19, 2022

aplanas Jan 19, 2022

THS-on Jan 19, 2022

aplanas Jan 19, 2022

THS-on Jan 19, 2022

maugustosilva Jan 19, 2022

aplanas Jan 19, 2022

THS-on Jan 19, 2022

aplanas Jan 19, 2022

aplanas Jan 19, 2022

THS-on Jan 19, 2022

THS-on commented Jan 19, 2022

Various fixes and improvements to the web components #837

Various fixes and improvements to the web components #837

Conversation

THS-on commented Jan 16, 2022

THS-on commented Jan 17, 2022

aplanas commented Jan 17, 2022

aplanas commented Jan 17, 2022

THS-on commented Jan 17, 2022

aplanas commented Jan 17, 2022

mpeters commented Jan 17, 2022

THS-on commented Jan 18, 2022

kkaarreell commented Jan 18, 2022 • edited

THS-on commented Jan 19, 2022

kkaarreell commented Jan 19, 2022

THS-on commented Jan 19, 2022

kkaarreell commented Jan 19, 2022

THS-on commented Jan 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

THS-on commented Jan 19, 2022

kkaarreell commented Jan 18, 2022 •

edited