Skip to content

Commit 6b5b71f

Browse files
committed
KVM: TDX: ADD pages to the TD image while populating mirror EPT entries
When populating the initial memory image for a TDX guest, ADD pages to the TD as part of establishing the mappings in the mirror EPT, as opposed to creating the mappings and then doing ADD after the fact. Doing ADD in the S-EPT callbacks eliminates the need to track "premapped" pages, as the mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails, KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT). Eliminating the hole where the M-EPT can have a mapping that doesn't exist in the S-EPT in turn obviates the need to handle errors that are unique to encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()). Keeping the M-EPT and S-EPT synchronized also eliminates the need to check for unconsumed "premap" entries during tdx_td_finalize(), as there simply can't be any such entries. Dropping that check in particular reduces the overall cognitive load, as the management of nr_premapped with respect to removal of S-EPT is _very_ subtle. E.g. successful removal of an S-EPT entry after it completed ADD doesn't adjust nr_premapped, but it's not clear why that's "ok" but having half-baked entries is not (it's not truly "ok" in that removing pages from the image will likely prevent the guest from booting, but from KVM's perspective it's "ok"). Doing ADD in the S-EPT path requires passing an argument via a scratch field, but the current approach of tracking the number of "premapped" pages effectively does the same. And the "premapped" counter is much more dangerous, as it doesn't have a singular lock to protect its usage, since nr_premapped can be modified as soon as mmu_lock is dropped, at least in theory. I.e. nr_premapped is guarded by slots_lock, but only for "happy" paths. Note, this approach was used/tried at various points in TDX development, but was ultimately discarded due to a desire to avoid stashing temporary state in kvm_tdx. But as above, KVM ended up with such state anyways, and fully committing to using temporary state provides better access rules (100% guarded by slots_lock), and makes several edge cases flat out impossible. Note #2, continue to extend the measurement outside of mmu_lock, as it's a slow operation (typically 16 SEAMCALLs per page whose data is included in the measurement), and doesn't *need* to be done under mmu_lock, e.g. for consistency purposes. However, MR.EXTEND isn't _that_ slow, e.g. ~1ms latency to measure a full page, so if it needs to be done under mmu_lock in the future, e.g. because KVM gains a flow that can remove S-EPT entries during KVM_TDX_INIT_MEM_REGION, then extending the measurement can also be moved into the S-EPT mapping path (again, only if absolutely necessary). P.S. _If_ MR.EXTEND is moved into the S-EPT path, take care not to return an error up the stack if TDH_MR_EXTEND fails, as removing the M-EPT entry but not the S-EPT entry would result in inconsistent state! Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
1 parent b4b2b6e commit 6b5b71f

File tree

2 files changed

+44
-72
lines changed

2 files changed

+44
-72
lines changed

arch/x86/kvm/vmx/tdx.c

Lines changed: 38 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -1583,6 +1583,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
15831583
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
15841584
}
15851585

1586+
static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
1587+
kvm_pfn_t pfn)
1588+
{
1589+
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
1590+
u64 err, entry, level_state;
1591+
gpa_t gpa = gfn_to_gpa(gfn);
1592+
1593+
lockdep_assert_held(&kvm->slots_lock);
1594+
1595+
if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
1596+
KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
1597+
return -EIO;
1598+
1599+
err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
1600+
kvm_tdx->page_add_src, &entry, &level_state);
1601+
if (unlikely(tdx_operand_busy(err)))
1602+
return -EBUSY;
1603+
1604+
if (KVM_BUG_ON(err, kvm)) {
1605+
pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
1606+
return -EIO;
1607+
}
1608+
1609+
return 0;
1610+
}
1611+
15861612
static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
15871613
enum pg_level level, kvm_pfn_t pfn)
15881614
{
@@ -1628,19 +1654,10 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
16281654

16291655
/*
16301656
* If the TD isn't finalized/runnable, then userspace is initializing
1631-
* the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of
1632-
* pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD.
1633-
* KVM_TDX_FINALIZE_VM checks the counter to ensure all pre-mapped
1634-
* pages have been added to the image, to prevent running the TD with a
1635-
* valid mapping in the mirror EPT, but not in the S-EPT.
1636-
*/
1637-
if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
1638-
if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
1639-
return -EIO;
1640-
1641-
atomic64_inc(&kvm_tdx->nr_premapped);
1642-
return 0;
1643-
}
1657+
* the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD.
1658+
*/
1659+
if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
1660+
return tdx_mem_page_add(kvm, gfn, level, pfn);
16441661

16451662
return tdx_mem_page_aug(kvm, gfn, level, pfn);
16461663
}
@@ -1666,39 +1683,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
16661683
return 0;
16671684
}
16681685

1669-
/*
1670-
* Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
1671-
* mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
1672-
* successfully.
1673-
*
1674-
* Since tdh_mem_sept_add() must have been invoked successfully before a
1675-
* non-leaf entry present in the mirrored page table, the SEPT ZAP related
1676-
* SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
1677-
* find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
1678-
* SEPT.
1679-
*
1680-
* Further check if the returned entry from SEPT walking is with RWX permissions
1681-
* to filter out anything unexpected.
1682-
*
1683-
* Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
1684-
* level_state returned from a SEAMCALL error is the same as that passed into
1685-
* the SEAMCALL.
1686-
*/
1687-
static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
1688-
u64 entry, int level)
1689-
{
1690-
if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
1691-
return false;
1692-
1693-
if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
1694-
return false;
1695-
1696-
if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
1697-
return false;
1698-
1699-
return true;
1700-
}
1701-
17021686
static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
17031687
enum pg_level level, struct page *page)
17041688
{
@@ -1718,12 +1702,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
17181702
err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
17191703
tdx_no_vcpus_enter_stop(kvm);
17201704
}
1721-
if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
1722-
if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
1723-
return -EIO;
1724-
1725-
return 0;
1726-
}
17271705

17281706
if (KVM_BUG_ON(err, kvm)) {
17291707
pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
@@ -2839,12 +2817,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
28392817

28402818
if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
28412819
return -EINVAL;
2842-
/*
2843-
* Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
2844-
* TDH.MEM.PAGE.ADD().
2845-
*/
2846-
if (atomic64_read(&kvm_tdx->nr_premapped))
2847-
return -EINVAL;
28482820

28492821
cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
28502822
if (tdx_operand_busy(cmd->hw_error))
@@ -3141,6 +3113,9 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
31413113
struct page *src_page;
31423114
int ret, i;
31433115

3116+
if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
3117+
return -EIO;
3118+
31443119
/*
31453120
* Get the source page if it has been faulted in. Return failure if the
31463121
* source page has been swapped out or unmapped in primary memory.
@@ -3151,19 +3126,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
31513126
if (ret != 1)
31523127
return -ENOMEM;
31533128

3129+
kvm_tdx->page_add_src = src_page;
31543130
ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
3155-
if (ret < 0)
3156-
goto out;
3131+
kvm_tdx->page_add_src = NULL;
31573132

3158-
ret = 0;
3159-
err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
3160-
src_page, &entry, &level_state);
3161-
if (err) {
3162-
ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
3163-
goto out;
3164-
}
3133+
put_page(src_page);
31653134

3166-
KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
3135+
if (ret)
3136+
return ret;
31673137

31683138
if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
31693139
for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
@@ -3176,8 +3146,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
31763146
}
31773147
}
31783148

3179-
out:
3180-
put_page(src_page);
31813149
return ret;
31823150
}
31833151

arch/x86/kvm/vmx/tdx.h

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,12 @@ struct kvm_tdx {
3636

3737
struct tdx_td td;
3838

39-
/* For KVM_TDX_INIT_MEM_REGION. */
40-
atomic64_t nr_premapped;
39+
/*
40+
* Scratch pointer used to pass the source page to tdx_mem_page_add().
41+
* Protected by slots_lock, and non-NULL only when mapping a private
42+
* pfn via tdx_gmem_post_populate().
43+
*/
44+
struct page *page_add_src;
4145

4246
/*
4347
* Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do

0 commit comments

Comments
 (0)