Skip to content

Conversation

jogme
Copy link
Contributor

@jogme jogme commented Aug 7, 2025

This is a backport PR for 3.5 from #28115

Checklist
  • documentation is added or updated
  • tests are added or updated

jogme and others added 3 commits August 7, 2025 10:38
Signed-off-by: Norbert Pocs <norbertp@openssl.org>
When HRR happens a second client hello is sent and it consist of a
transport params extension. This must be processed and not cause
failure.

Fixes: openssl/project#1296

Signed-off-by: Norbert Pocs <norbertp@openssl.org>
Recently, our overnight QUIC interop runs began failing in CI when an
openssl server was tested against an ngtcp2 client:
https://github.com/openssl/openssl/actions/runs/16739736813

The underlying cause bears some explination for historical purposes

The problem began happening with a recent update to ngtcp2 in which
ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused
ngtcp to emit a client hello message that offered several groups
(including X25519MLKEM768) but only provided a keyshare for x25519.
This in turn triggered the openssl server to respond with a hello retry
request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2
obliged. However all subsequent frames from the client were discarded by
the server, due to failing packet body decryption.

The problem was tracked down to a mismatch in the initial vectors used
by the client and server, leading to an AEAD tag mismatch.

Packet protection keys generate their IV's in QUIC by xoring the packet
number of the received frame with the base IV as derived via HKDF in the
tls layer.

The underlying problem was that openssl hit a very odd corner case with
how we compute the packet number of the received frame.  To save space,
QUIC encodes packet numbers using a variable length integer, and only
sends the changed bits in the packet number.  This requires that the
receiver (openssl) store the largest received pn of the connection,
which we nominally do.

However, in default_port_packet_handler (where QUIC frames are processed
prior to having an established channel allocated) we use a temporary qrx
to validate the packet protection of those frames.  This temporary qrx
may be incorporated into the channel in some cases, but is not in the
case of a valid frame that generates an HRR at the TLS layer.  In this
case, the channel allocates its own qrx independently.  When this
occurs, the largest_pn value of the temporary qrx is lost, and
subsequent frames are unable to be received, as the newly allocated qrx
belives that the larges_pn for a given pn_space is 0, rather than the
value received in the initial frame (which was a complete 32 bit value,
rather than just the changed lower 8 bits).  As a result the IV
construction produced the wrong value, and the decrypt failed on those
subsequent frames.

Up to this point, that wasn't even a problem, as most quic
implementations start their packet numbering at 0, so the next packet
could still have its packet number computed properly.  The combination
of ngtcp using large random values for initial packet numbers, along
with the HRR triggering a separate qrx creation on a channel led to the
discovery of this discrepancy.

The fix seems pretty straightforward.  When we detect in
port_default_packet_handler, that we have a separate qrx in the new
channel, we migrate processed packets from the temporary qrx to the
canonical channel qrx.  In addition to doing that, we also need to
migrate the largest_pn array from the temporary qrx to the channel_qrx
so that subsequent frame reception is guaranteed to compute the received
frame packet number properly, and as such, compute the proper IV for
packet protection decryption.

Fixes openssl/project#1296
@jogme jogme requested a review from quarckster as a code owner August 7, 2025 08:40
@jogme jogme changed the base branch from master to openssl-3.5 August 7, 2025 08:40
@jogme jogme closed this Aug 7, 2025
@jogme jogme reopened this Aug 7, 2025
Copy link
Contributor

@Sashan Sashan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. thanks.

@nhorman nhorman added approval: done This pull request has the required number of approvals branch: 3.5 Merge to openssl-3.5 labels Aug 7, 2025
@t8m t8m added approval: review pending This pull request needs review by a committer triaged: bug The issue/pr is/fixes a bug tests: exempted The PR is exempt from requirements for testing tests: present The PR has suitable tests present and removed tests: exempted The PR is exempt from requirements for testing approval: review pending This pull request needs review by a committer labels Aug 8, 2025
@nhorman nhorman moved this to Waiting Merge in Development Board Aug 8, 2025
@openssl-machine openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Aug 8, 2025
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@nhorman
Copy link
Contributor

nhorman commented Aug 8, 2025

merged to 3.5, thank you!

@nhorman nhorman closed this Aug 8, 2025
@github-project-automation github-project-automation bot moved this from Waiting Merge to Done in Development Board Aug 8, 2025
openssl-machine pushed a commit that referenced this pull request Aug 8, 2025
Signed-off-by: Norbert Pocs <norbertp@openssl.org>

Reviewed-by: Saša Nedvědický <sashan@openssl.org>
Reviewed-by: Neil Horman <nhorman@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #28189)
openssl-machine pushed a commit that referenced this pull request Aug 8, 2025
When HRR happens a second client hello is sent and it consist of a
transport params extension. This must be processed and not cause
failure.

Fixes: openssl/project#1296

Signed-off-by: Norbert Pocs <norbertp@openssl.org>

Reviewed-by: Saša Nedvědický <sashan@openssl.org>
Reviewed-by: Neil Horman <nhorman@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #28189)
openssl-machine pushed a commit that referenced this pull request Aug 8, 2025
Recently, our overnight QUIC interop runs began failing in CI when an
openssl server was tested against an ngtcp2 client:
https://github.com/openssl/openssl/actions/runs/16739736813

The underlying cause bears some explination for historical purposes

The problem began happening with a recent update to ngtcp2 in which
ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused
ngtcp to emit a client hello message that offered several groups
(including X25519MLKEM768) but only provided a keyshare for x25519.
This in turn triggered the openssl server to respond with a hello retry
request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2
obliged. However all subsequent frames from the client were discarded by
the server, due to failing packet body decryption.

The problem was tracked down to a mismatch in the initial vectors used
by the client and server, leading to an AEAD tag mismatch.

Packet protection keys generate their IV's in QUIC by xoring the packet
number of the received frame with the base IV as derived via HKDF in the
tls layer.

The underlying problem was that openssl hit a very odd corner case with
how we compute the packet number of the received frame.  To save space,
QUIC encodes packet numbers using a variable length integer, and only
sends the changed bits in the packet number.  This requires that the
receiver (openssl) store the largest received pn of the connection,
which we nominally do.

However, in default_port_packet_handler (where QUIC frames are processed
prior to having an established channel allocated) we use a temporary qrx
to validate the packet protection of those frames.  This temporary qrx
may be incorporated into the channel in some cases, but is not in the
case of a valid frame that generates an HRR at the TLS layer.  In this
case, the channel allocates its own qrx independently.  When this
occurs, the largest_pn value of the temporary qrx is lost, and
subsequent frames are unable to be received, as the newly allocated qrx
belives that the larges_pn for a given pn_space is 0, rather than the
value received in the initial frame (which was a complete 32 bit value,
rather than just the changed lower 8 bits).  As a result the IV
construction produced the wrong value, and the decrypt failed on those
subsequent frames.

Up to this point, that wasn't even a problem, as most quic
implementations start their packet numbering at 0, so the next packet
could still have its packet number computed properly.  The combination
of ngtcp using large random values for initial packet numbers, along
with the HRR triggering a separate qrx creation on a channel led to the
discovery of this discrepancy.

The fix seems pretty straightforward.  When we detect in
port_default_packet_handler, that we have a separate qrx in the new
channel, we migrate processed packets from the temporary qrx to the
canonical channel qrx.  In addition to doing that, we also need to
migrate the largest_pn array from the temporary qrx to the channel_qrx
so that subsequent frame reception is guaranteed to compute the received
frame packet number properly, and as such, compute the proper IV for
packet protection decryption.

Fixes openssl/project#1296

Reviewed-by: Saša Nedvědický <sashan@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #28189)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval: ready to merge The 24 hour grace period has passed, ready to merge branch: 3.5 Merge to openssl-3.5 tests: present The PR has suitable tests present triaged: bug The issue/pr is/fixes a bug

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants