Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various fixes and improvements to the web components #837

Merged
merged 10 commits into from Jan 19, 2022

Conversation

THS-on
Copy link
Member

@THS-on THS-on commented Jan 16, 2022

Initially the idea was to convert all the components to the Tornado framework with #835, but moving the current threaded code to async did cause more problems than it solved.

This PR takes the general improvements from #835 without changing the components to Tornado.

@THS-on THS-on changed the title Varous fixes and improvements to the web components Various fixes and improvements to the web components Jan 16, 2022
@THS-on
Copy link
Member Author

THS-on commented Jan 17, 2022

@aplanas in this PR the web related functions are moved from config.py to web_util.py, so you don't have to move them for #831.

@aplanas
Copy link
Contributor

aplanas commented Jan 17, 2022

@aplanas in this PR the web related functions are moved from config.py to web_util.py, so you don't have to move them for #831.

Oh very cool. I paused #831 for now, and I will rebase it on top of this one.

@aplanas
Copy link
Contributor

aplanas commented Jan 17, 2022

@THS-on BTW, what are your thoughts of taking advantage of the work done during kind of changes, and provide unit tests for them?

Is a bit more more, but also those changes can be split in smaller commits.

@THS-on
Copy link
Member Author

THS-on commented Jan 17, 2022

@aplanas adding more unit tests is a good idea. Especially now that we refactor some parts of Keylime it is nice to have some confidence that the behaviour did not change.

@aplanas
Copy link
Contributor

aplanas commented Jan 17, 2022

@aplanas adding more unit tests is a good idea. Especially now that we refactor some parts of Keylime it is nice to have some confidence that the behaviour did not change.

I am of the same opinion! I added the tests in my last PR that changed the config, and I think that a good candidate to test here are the moved echo_json_response, list_to_dict or get_restful_params, and maybe bits of revocation_listener (like extracting the inner function perform_actions and testing it independently) or init_mtls.

@mpeters
Copy link
Member

mpeters commented Jan 17, 2022

I agree with @aplanas. This change looks good (better organization and less duplication) and will be approved after tests are added.

The code is shared between registrar and verifier.

Signed-off-by: Thore Sommer <mail@thson.de>
tornado_response was renamed to TornadoResponse.

The json and yaml methods were removed because they were unsed.

Signed-off-by: Thore Sommer <mail@thson.de>
The response object might not have error as a member.

Signed-off-by: Thore Sommer <mail@thson.de>
This functions is not required for the Keylime configuration, so it is
moved into the web_util module.

Signed-off-by: Thore Sommer <mail@thson.de>
Removed busy waiting for the threads and removed abstractions for starting
and stopping threads/servers.

Signed-off-by: Thore Sommer <mail@thson.de>
@THS-on
Copy link
Member Author

THS-on commented Jan 18, 2022

@mpeters @aplanas I've now added tests for echo_json_response and get_restful_params.

@kkaarreell
Copy link
Contributor

kkaarreell commented Jan 18, 2022

Hello @THS-on , regarding the following test failure:

:: [ 19:42:36 ] :: [  BEGIN   ] :: Running 'limeStopAgent'
Process wasn't terminated, sending SIGKILL signal...
:: [ 19:42:39 ] :: [   FAIL   ] :: Command 'limeStopAgent' (Expected 0, got 9)
/usr/share/beakerlib/journal.sh: line 1: 16580 Killed                  keylime_${NAME} >> ${LOGFILE} 2>&1

This looks similar to your former PR. In this case agent didn't stop in 3 seconds after receiving SIGTERM so it has been killed with SIGKILL.

In your former PR there was a traceback after SIGINT (not SIGTERM) but it was slightly different situation. For few days I was using in tests SIGINT to stop the process (since SIGINT has to be used to terminate a process when executed via the coverage script) but later I have changed it to use SIGINT only for coverage script and SIGTERM directly otherwise.

Just now I have checked how does your new code behave when SIGINT is sent and the output was:

                out: Sending SIGINT
                out: :: [ 16:23:25 ] :: [  BEGIN   ] :: Running 'limeStopAgent'
                out: 2022-01-18 16:23:24.591 - keylime.local_action_rm_ssh - INFO - my serial: 2, cert serial: 2
                out: 2022-01-18 16:23:25.861 - keylime.cloudagent - INFO - Stopping revocation listener...
                out: 2022-01-18 16:23:25.863 - keylime.cloudagent - INFO - TERM Signal received, shutting down...
                out: 2022-01-18 16:23:28.892 - keylime.cloudagent - INFO - TERM Signal received, shutting down...
                out: 2022-01-18 16:23:28.920 - keylime.tpm - WARNING - Could not read YAML output of tpm2_getcap.
                out: Exception ignored in: <module 'threading' from '/usr/lib64/python3.10/threading.py'>
                out: Traceback (most recent call last):
                out:   File "/usr/lib64/python3.10/threading.py", line 1551, in _shutdown
                out:     lock.acquire()
                out:   File "/usr/local/lib/python3.10/site-packages/keylime-0.0.0-py3.10.egg/keylime/keylime_agent.py", line 710, in shutdown_handler
                out:     sys.exit(0)
                out: SystemExit: 0
                out: :: [ 16:23:31 ] :: [   PASS   ] :: Command 'limeStopAgent' (Expected 0, got 0)

Not sure if that helps.

EDIT: Btw, for the SIGTERM case the last message in agent.log was

keylime.cloudagent - INFO - TERM Signal received, shutting down...

EDIT2: Looking at timestamps above, maybe 3 seconds are simply not enough. I will check it.

EDIT3: Even waiting 15 seconds didn't make a difference. Related tests PR RedHat-SP-Security/keylime-tests#29

@THS-on THS-on force-pushed the tornado-fixes branch 4 times, most recently from 7fe6898 to 4402ef9 Compare January 19, 2022 07:59
@THS-on
Copy link
Member Author

THS-on commented Jan 19, 2022

@kkaarreell thank you for the detailed information. Can I get this output log also from the CI or do I have to run the tests locally for that.

It seems that there is still some thread running. I join all the threads/processes that I started, so I don't know what thread might cause this.

Do you have a simple way to reproduce it outside the tests because if I just start the agent and stop it with SIGINT it works on my machine?

@kkaarreell
Copy link
Contributor

Hello @THS-on , unfortunately agent.log is not available through CI, Packit interface doesn't allow to upload attachments. Also, I believe that agent "freeze" has something to do the with the tested scenario (maybe something about revocation?) since the 2nd test is not failing this way and agent stops properly. I will do more tests today in order to identify the relevant test scenario that makes such a difference.

Btw, would it help to print e.g. last 20 lines from service log when service stop fails?

@THS-on
Copy link
Member Author

THS-on commented Jan 19, 2022

I will do more tests today in order to identify the relevant test scenario that makes such a difference.

Awesome, thank you.

Btw, would it help to print e.g. last 20 lines from service log when service stop fails?

That might be helpful to at least get an idea where in the process it failed.

@kkaarreell
Copy link
Contributor

Hi @THS-on , according to my testing it is autorun.sh script in payload that makes a difference.
In my test it looks like this
https://github.com/RedHat-SP-Security/keylime-tests/blob/main/functional/basic-attestation-on-localhost/payload/autorun.sh
I have confirmed that agent stops properly when this file is excluded from the payload.

@THS-on
Copy link
Member Author

THS-on commented Jan 19, 2022

@kkaarreell thank you for finding this.

I found out that the payload thread was not terminating correctly. This is now fixed by marking this thread as a daemon and fixing the infinity loop.

The revocation notifier is now started in a separate process and shutdown
is now handled a signal handler.

Signed-off-by: Thore Sommer <mail@thson.de>
A timeout might cause a HTTP 500 response.

Signed-off-by: Thore Sommer <mail@thson.de>
Signed-off-by: Thore Sommer <mail@thson.de>
Signed-off-by: Thore Sommer <mail@thson.de>
The payload thread is now marked as a daemon, so that it is terminated
correctly on shutdown.

Fixed that the logging of the process caused an infinity loop.

Signed-off-by: Thore Sommer <mail@thson.de>
@@ -24,6 +24,8 @@
import keylime_logging
import common
import sys

import keylime.web_util
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems that the used idiom in Keylime is from keylime import web_util, and use web_util.echo_json_response() in the code.

I do not mind if stay as it is now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't know if this demo code actually works this only changed by the automatic refactoring.

@@ -723,7 +724,7 @@ async def invoke_get_quote(agent, need_pubkey):

if response.status_code != 200:
# this is a connection error, retry get quote
if response.status_code == 599:
if response.status_code in [500, 599]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that process_agent will take care of limit the number of retries, but 500 seems a serious issue in the server. Do makes sense to retry on internal error? Do you think that will be a transient issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that socket timeouts are treated as 500 status codes. If we find another way to distinguish them from other 500 issues I'm happy to change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been seeing a growing number of transient 500 status codes which are exactly sockets timeouts, at scale (with hundreds to thousands of agents). It is imperative we retry in case we get the aforementioned status code.

@@ -338,7 +337,7 @@ def initthread():
logger.info("No payload script %s found in %s/unzipped", initscript, secdir)
else:
logger.info("Executing payload script: %s/unzipped/%s", secdir, initscript)
payload_thread = threading.Thread(target=initthread)
payload_thread = threading.Thread(target=initthread, daemon=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure here. Maybe this is needed and correct, but I feel that making it possible that the Python code can ends and exit before the thread is done, can cause problems.

For example, I am not sure what will happen if the agent throws the payload script in a thread, that is executed inside {secdir}/zypper, and we end the agent that maybe will try to umount {secdir}. This can make the umount to fail and leave the {secdir} present in the system, or cause problems in the payload script later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right. This fix is not totally optimal, but do we want the agent to hang if the payload script does not terminate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but do we want the agent to hang if the payload script does not terminate?

Right ...

Maybe this future umount should be using the --lazy parameter.

signal.signal(signal.SIGINT, shutdown_handler)

# Keep the main thread alive by waiting for the server thread
serverthread.join()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do revocation_process.join()?

maybe we can register the payload script thread too, and join it here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really because this will already keep the main thread active until we call the shutdown handler.

@THS-on
Copy link
Member Author

THS-on commented Jan 19, 2022

@mpeters I think this is now ready to merge.

@mpeters mpeters merged commit f3fb006 into keylime:master Jan 19, 2022
@THS-on THS-on deleted the tornado-fixes branch January 20, 2022 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants