Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAL/Linux-SGX] Re-init attestation key after AESM restarted #1881

Merged
merged 1 commit into from
Jun 6, 2024

Conversation

dimakuv
Copy link
Contributor

@dimakuv dimakuv commented May 13, 2024

Description of the changes

Gramine uses the AESM daemon to request SGX Quotes from the Quoting Enclave. AESM runs in the background and may be restarted. For some reason, after a restart AESM "forgets" the platform's attestation key and requires an additional INIT_QUOTE_REQUEST before it is able to generate SGX Quotes again. Moreover, the Quoting Enclave's Target Info may change at this point, so Gramine must update its cached QE Target Info. This behavior is applicable only to DCAP attestation and is documented in Intel SGX PSW/DCAP software. Gramine didn't implement this behavior and thus failed with "AESM service returned error 42".

This PR complies with the SGX PSW/DCAP documentation and adds the send of INIT_QUOTE_REQUEST + update of QE Target Info + retry of SGX quote retrieval.

This is a re-creation of #1877, with a better design. Closes #1877.

How to test this PR?

Change CI-Examples/python/scripts/sgx-quote.py to this:

import os
import sys
import time

while True:
    if not os.path.exists("/dev/attestation/quote"):
        print("Cannot find `/dev/attestation/quote`; "
                "are you running under SGX, with remote attestation enabled?")
        sys.exit(1)

    with open('/dev/attestation/attestation_type') as f:
        print(f"Detected attestation type: {f.read()}")

    with open("/dev/attestation/user_report_data", "wb") as f:
        f.write(b'\0'*64)

    with open("/dev/attestation/quote", "rb") as f:
        quote = f.read()

    print(f"Extracted SGX quote with size = {len(quote)} and the following fields:")
    print(f"  ATTRIBUTES.FLAGS: {quote[96:104].hex()}  [ Debug bit: {quote[96] & 2 > 0} ]")
    print(f"  ATTRIBUTES.XFRM:  {quote[104:112].hex()}")
    print(f"  MRENCLAVE:        {quote[112:144].hex()}")
    print(f"  MRSIGNER:         {quote[176:208].hex()}")
    print(f"  ISVPRODID:        {quote[304:306].hex()}")
    print(f"  ISVSVN:           {quote[306:308].hex()}")
    print(f"  REPORTDATA:       {quote[368:400].hex()}")
    print(f"                    {quote[400:432].hex()}")

    sys.stdout.flush()
    time.sleep(5)

Apply the above Python script, run it in one terminal, and in another terminal do sudo systemctl restart aesmd. Without this PR, Gramine will fail. With this PR, Gramine will continue execution of the endless loop.


This change is Reviewable

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)


-- commits line 2 at r1:

Suggestion:

Re-init attestation key after an AESM restart

-- commits line 15 at r1:
ditto later in this sentence

Suggestion:

sending

pal/src/host/linux-sgx/enclave_ocalls.h line 136 at r1 (raw file):

 *
 * \param      is_epid            True if EPID is used, false if DCAP/ECDSA is used.
 * \param[out] qe_targetinfo      Retrieved Quoting Enclave's target info; buffer must be submitted

I'd drop this note altogether, isn't it obvious? It's a single-level pointer, so there's no way for the function to allocate a buffer by itself and return it.

Suggestion:

buffer must be provided

pal/src/host/linux-sgx/enclave_ocalls.h line 141 at r1 (raw file):

 * \returns 0 on success, negative Linux error code otherwise.
 *
 * The obtained QE Target Info is not validated in any way (i.e., this function does not check

Suggestion:

this function does not check whether

pal/src/host/linux-sgx/enclave_ocalls.h line 144 at r1 (raw file):

 * Target Info contents make sense).
 */
int ocall_get_qe_targetinfo(bool is_epid, sgx_target_info_t* qe_targetinfo);

Double checking if I understand the security model here correctly: qe_targetinfo is completely untrusted, and thus we can't assume that we're talking directly to a real QE (we can be talking to a fake one, or be MitM-ed).
This is not a problem, because only the authentic QE has access to the keys to generate a valid quote (so, the fake QE can't), and our report targets whatever we got in qe_targetinfo and the MitMer can't re-encrypt it and then send it to a real QE (due to how EREPORT works).
Is this right?


pal/src/host/linux-sgx/enclave_ocalls.c line 2199 at r1 (raw file):

    if (!ocall_qe_ti_args) {
        sgx_reset_ustack(old_ustack);
        return -ENOMEM;

Almost all other places return -EPERM in this case.


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):

/* this function assumes proper synchronization by callers, such that
 * g_pal_linuxsgx_state.qe_targetinfo is not racey */

This should be documented in the header, not here. Also, it's not only about calling this function multiple times concurrently, but also about anything else which could be reading g_pal_linuxsgx_state.qe_targetinfo at the same time.


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):

/* this function assumes proper synchronization by callers, such that
 * g_pal_linuxsgx_state.qe_targetinfo is not racey */

Do we actually have that synchronization in place? Also, it's quite a surprising/risky interface. Why not just use a lock?

Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 7 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @llly, and @mkow)


-- commits line 2 at r1:
Will do during final rebase


-- commits line 15 at r1:

Previously, mkow (Michał Kowalczyk) wrote…

ditto later in this sentence

Will do during final rebase


pal/src/host/linux-sgx/enclave_ocalls.h line 136 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

I'd drop this note altogether, isn't it obvious? It's a single-level pointer, so there's no way for the function to allocate a buffer by itself and return it.

Done.


pal/src/host/linux-sgx/enclave_ocalls.h line 141 at r1 (raw file):

 * \returns 0 on success, negative Linux error code otherwise.
 *
 * The obtained QE Target Info is not validated in any way (i.e., this function does not check

Done.


pal/src/host/linux-sgx/enclave_ocalls.h line 144 at r1 (raw file):
This is correct. Though one of your phrases is vague:

the MitMer can't re-encrypt it and then send it to a real QE

What do you mean by re-encrypt? The MitMer can't replace the measurements in the intercepted SGX Report (like MRENCLAVE), because then the MAC of the SGX Report won't correspond to the overridden values. And the MitMer cannot generate the new MAC because she doesn't have the corresponding SGX key (with which the SGX Report was signed). And if the MitMer would call EREPORT instruction, then she won't be able to use victim-enclave measurements, because EREPORT + EINIT disallow this.


pal/src/host/linux-sgx/enclave_ocalls.c line 2199 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Almost all other places return -EPERM in this case.

Done, though this doesn't sound right. But here it's not important which error code we return, so whatever.


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Do we actually have that synchronization in place? Also, it's quite a surprising/risky interface. Why not just use a lock?

We always had this assumption:

* This pseudo-FS interface is not thread-safe. It is the responsibility of the application to
* correctly synchronize concurrent accesses to the pseudo-files. We expect attestation flows to
* be generally single-threaded and therefore do not introduce synchronization here.

Even if we would need a lock, it shouldn't be at this level, but rather inside LibOS.


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):
Done.

it's not only about calling this function multiple times concurrently, but also about anything else which could be reading g_pal_linuxsgx_state.qe_targetinfo at the same time

Yes, but the only function that operates on this global var's field is this one (sgx_get_quote()). Do you want me to add something additionally?

@woju
Copy link
Member

woju commented Jun 4, 2024

Jenkins, test this please

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r2, all commit messages.
Reviewable status: 6 of 7 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Done.

it's not only about calling this function multiple times concurrently, but also about anything else which could be reading g_pal_linuxsgx_state.qe_targetinfo at the same time

Yes, but the only function that operates on this global var's field is this one (sgx_get_quote()). Do you want me to add something additionally?

Yes. Right now that line is:

sgx_target_info_t qe_targetinfo; /* received from untrusted host, use carefully */

This comment sounds like this field was received from the host and was constant afterwards. I can easily see how we could add some functionality using this and not notice that it can change.

Actually, maybe it would be cleaner if we moved that field from struct pal_linuxsgx_state to a private variable in enclave_platform.c and initialize in on ECALL_ENCLAVE_START? (via calling something like initialize_quoting() which would copy the enclave start argument into that private, global variable)
It should be obvious then that this variable is only for the use of sgx_quote() and would be much harder to misuse.
This way you could also lock accesses to it properly and remove that tricky assumption from sgx_get_quote().


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We always had this assumption:

* This pseudo-FS interface is not thread-safe. It is the responsibility of the application to
* correctly synchronize concurrent accesses to the pseudo-files. We expect attestation flows to
* be generally single-threaded and therefore do not introduce synchronization here.

Even if we would need a lock, it shouldn't be at this level, but rather inside LibOS.

See above. I think it's obvious from the comment you quoted that e.g. setting /dev/attestation/user_report_data and reading /dev/attestation/quote cannot be concurrent, but it would still be surprising to me that I cannot read /dev/attestation/quote concurrently (because it's just reading).

Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 9 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @llly and @mkow)


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Yes. Right now that line is:

sgx_target_info_t qe_targetinfo; /* received from untrusted host, use carefully */

This comment sounds like this field was received from the host and was constant afterwards. I can easily see how we could add some functionality using this and not notice that it can change.

Actually, maybe it would be cleaner if we moved that field from struct pal_linuxsgx_state to a private variable in enclave_platform.c and initialize in on ECALL_ENCLAVE_START? (via calling something like initialize_quoting() which would copy the enclave start argument into that private, global variable)
It should be obvious then that this variable is only for the use of sgx_quote() and would be much harder to misuse.
This way you could also lock accesses to it properly and remove that tricky assumption from sgx_get_quote().

Done.


pal/src/host/linux-sgx/enclave_platform.c line 8 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

See above. I think it's obvious from the comment you quoted that e.g. setting /dev/attestation/user_report_data and reading /dev/attestation/quote cannot be concurrent, but it would still be surprising to me that I cannot read /dev/attestation/quote concurrently (because it's just reading).

Done.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/host/linux-sgx/enclave_platform.c line 14 at r3 (raw file):

int init_qe_targetinfo(void* uptr_qe_targetinfo) {
    /* no need to acquire g_qe_targetinfo_lock at this point since there is a single thread */

Suggestion:

since there is only a single thread */

pal/src/host/linux-sgx/enclave_platform.c line 34 at r3 (raw file):

        if (retries) {
            /* new attempt, need to update QE target info */
            ret = ocall_get_qe_targetinfo(/*is_epid=*/!!spid, &g_qe_targetinfo);

nitpick: you could OCALL into a local variable, and then only memcpy under the spinlock
...which you actually do anyways, below, so just swap these two :)

Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 9 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)


pal/src/host/linux-sgx/enclave_platform.c line 14 at r3 (raw file):

int init_qe_targetinfo(void* uptr_qe_targetinfo) {
    /* no need to acquire g_qe_targetinfo_lock at this point since there is a single thread */

Done.


pal/src/host/linux-sgx/enclave_platform.c line 34 at r3 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

nitpick: you could OCALL into a local variable, and then only memcpy under the spinlock
...which you actually do anyways, below, so just swap these two :)

Done.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

Copy link
Contributor

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 6 files at r1, 2 of 4 files at r2, 3 of 4 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

Gramine uses the AESM daemon to request SGX Quotes from the Quoting
Enclave. AESM runs in the background and may be restarted. For some
reason, after a restart AESM "forgets" the platform's attestation key
and requires an additional INIT_QUOTE_REQUEST before it is able to
generate SGX Quotes again. Moreover, the Quoting Enclave's Target Info
may change at this point, so Gramine must update its cached QE Target
Info. This behavior is applicable only to DCAP attestation and is
documented in Intel SGX PSW/DCAP software. Gramine didn't implement this
behavior and thus failed with "AESM service returned error 42".

This commit complies with the SGX PSW/DCAP documentation and adds the
sending of INIT_QUOTE_REQUEST + updating QE Target Info + retrying the
SGX quote retrieval.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
@dimakuv dimakuv force-pushed the dimakuv/fix-aesmd-restart-failure-v2 branch from d6e6bb9 to 8a804e5 Compare June 6, 2024 07:35
Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mkow)


-- commits line 2 at r1:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Will do during final rebase

Done.


-- commits line 15 at r1:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Will do during final rebase

Done.

Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mkow)

a discussion (no related file):
After final rebase, tested this PR one last time on a DCAP system -- everything works.


Copy link
Contributor

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mkow)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@mkow mkow merged commit 8a804e5 into master Jun 6, 2024
18 of 20 checks passed
@mkow mkow deleted the dimakuv/fix-aesmd-restart-failure-v2 branch June 6, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants