Skip to content

Commit 18ab3fc

Browse files
cai-fuqiangsean-jc
authored andcommitted
KVM: x86: Fix VM hard lockup after prolonged inactivity with periodic HV timer
When advancing the target expiration for the guest's APIC timer in periodic mode, set the expiration to "now" if the target expiration is in the past (similar to what is done in update_target_expiration()). Blindly adding the period to the previous target expiration can result in KVM generating a practically unbounded number of hrtimer IRQs due to programming an expired timer over and over. In extreme scenarios, e.g. if userspace pauses/suspends a VM for an extended duration, this can even cause hard lockups in the host. Currently, the bug only affects Intel CPUs when using the hypervisor timer (HV timer), a.k.a. the VMX preemption timer. Unlike the software timer, a.k.a. hrtimer, which KVM keeps running even on exits to userspace, the HV timer only runs while the guest is active. As a result, if the vCPU does not run for an extended duration, there will be a huge gap between the target expiration and the current time the vCPU resumes running. Because the target expiration is incremented by only one period on each timer expiration, this leads to a series of timer expirations occurring rapidly after the vCPU/VM resumes. More critically, when the vCPU first triggers a periodic HV timer expiration after resuming, advancing the expiration by only one period will result in a target expiration in the past. As a result, the delta may be calculated as a negative value. When the delta is converted into an absolute value (tscdeadline is an unsigned u64), the resulting value can overflow what the HV timer is capable of programming. I.e. the large value will exceed the VMX Preemption Timer's maximum bit width of cpu_preemption_timer_multi + 32, and thus cause KVM to switch from the HV timer to the software timer (hrtimers). After switching to the software timer, periodic timer expiration callbacks may be executed consecutively within a single clock interrupt handler, because hrtimers honors KVM's request for an expiration in the past and immediately re-invokes KVM's callback after reprogramming. And because the interrupt handler runs with IRQs disabled, restarting KVM's hrtimer over and over until the target expiration is advanced to "now" can result in a hard lockup. E.g. the following hard lockup was triggered in the host when running a Windows VM (only relevant because it used the APIC timer in periodic mode) after resuming the VM from a long suspend (in the host). NMI watchdog: Watchdog detected hard LOCKUP on cpu 45 ... RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm] ... RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046 RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500 RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0 R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0 R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8 FS: 00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0 PKRU: 55555554 Call Trace: <IRQ> apic_timer_fn+0x31/0x50 [kvm] __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x210 ? ttwu_do_wakeup+0x19/0x160 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> Moreover, if the suspend duration of the virtual machine is not long enough to trigger a hard lockup in this scenario, since commit 98c25ea ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), KVM will continue using the software timer until the guest reprograms the APIC timer in some way. Since the periodic timer does not require frequent APIC timer register programming, the guest may continue to use the software timer in perpetuity. Fixes: d8f2f49 ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode") Cc: stable@vger.kernel.org Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com> [sean: massage comments and changelog] Link: https://patch.msgid.link/20251113205114.1647493-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
1 parent 9633f18 commit 18ab3fc

File tree

1 file changed

+23
-5
lines changed

1 file changed

+23
-5
lines changed

arch/x86/kvm/lapic.c

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2131,15 +2131,33 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
21312131
ktime_t delta;
21322132

21332133
/*
2134-
* Synchronize both deadlines to the same time source or
2135-
* differences in the periods (caused by differences in the
2136-
* underlying clocks or numerical approximation errors) will
2137-
* cause the two to drift apart over time as the errors
2138-
* accumulate.
2134+
* Use kernel time as the time source for both the hrtimer deadline and
2135+
* TSC-based deadline so that they stay synchronized. Computing each
2136+
* deadline independently will cause the two deadlines to drift apart
2137+
* over time as differences in the periods accumulate, e.g. due to
2138+
* differences in the underlying clocks or numerical approximation errors.
21392139
*/
21402140
apic->lapic_timer.target_expiration =
21412141
ktime_add_ns(apic->lapic_timer.target_expiration,
21422142
apic->lapic_timer.period);
2143+
2144+
/*
2145+
* If the new expiration is in the past, e.g. because userspace stopped
2146+
* running the VM for an extended duration, then force the expiration
2147+
* to "now" and don't try to play catch-up with the missed events. KVM
2148+
* will only deliver a single interrupt regardless of how many events
2149+
* are pending, i.e. restarting the timer with an expiration in the
2150+
* past will do nothing more than waste host cycles, and can even lead
2151+
* to a hard lockup in extreme cases.
2152+
*/
2153+
if (ktime_before(apic->lapic_timer.target_expiration, now))
2154+
apic->lapic_timer.target_expiration = now;
2155+
2156+
/*
2157+
* Note, ensuring the expiration isn't in the past also prevents delta
2158+
* from going negative, which could cause the TSC deadline to become
2159+
* excessively large due to it an unsigned value.
2160+
*/
21432161
delta = ktime_sub(apic->lapic_timer.target_expiration, now);
21442162
apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
21452163
nsec_to_cycles(apic->vcpu, delta);

0 commit comments

Comments
 (0)