[crmp] Check IsAckPendingBefore send ACK #7740

erjiaqing · 2021-06-18T07:50:28Z

Problem

In #7634, the test is failling

After go through the log, I found

2021-06-18T05:25:48.5815802Z [1623993948.179715][9747] CHIP:EM: Sending Standalone Ack for MsgId:0000009D
2021-06-18T05:25:48.5816327Z [1623993948.179726][9747] CHIP:IN: Secure message was encrypted: Msg ID 80
2021-06-18T05:25:48.5817018Z [1623993948.179729][9747] CHIP:IN: Encrypted message 0x7ffc7f445e70 from 0x0000000000BC5C01 to 0x000000000001B669 of type 16 and protocolId 0 on exchange 39093.
2021-06-18T05:25:48.5817750Z [1623993948.179733][9747] CHIP:IN: Sending msg 0x7ffc7f445e70 to 0x000000000001B669 at utc time: 523585 msec
2021-06-18T05:25:48.5818303Z [1623993948.179737][9747] CHIP:IN: Sending secure msg on generic transport
2021-06-18T05:25:48.5818813Z [1623993948.179755][9747] CHIP:IN: Secure msg send status No Error
2021-06-18T05:25:48.5819271Z [1623993948.179762][9747] CHIP:DMG: <RE:Run> Cluster 1295, Field 21 is dirty
2021-06-18T05:25:48.5819917Z [1623993948.179765][9747] CHIP:DMG: Received Cluster Command: Cluster=50f NodeId=0x0000000000BC5C01 Endpoint=1 FieldId=15 ListIndex=0
2021-06-18T05:25:48.5820563Z [1623993948.179771][9747] CHIP:DMG: <RE> Dumping report data...
2021-06-18T05:25:48.5820983Z [1623993948.179775][9747] CHIP:DMG: ReportData =
2021-06-18T05:25:48.5821349Z [1623993948.179778][9747] CHIP:DMG: {
2021-06-18T05:25:48.5821751Z [1623993948.179781][9747] CHIP:DMG: 	AttributeDataList =
2021-06-18T05:25:48.5822159Z [1623993948.179784][9747] CHIP:DMG: 	[
2021-06-18T05:25:48.5822595Z [1623993948.179787][9747] CHIP:DMG: 		AttributeDataElement =
2021-06-18T05:25:48.5823034Z [1623993948.179790][9747] CHIP:DMG: 		{
2021-06-18T05:25:48.5823409Z [1623993948.179793][9747] CHIP:DMG: 			AttributePath =
2021-06-18T05:25:48.5823789Z [1623993948.179796][9747] CHIP:DMG: 			{
2021-06-18T05:25:48.5824160Z [1623993948.179800][9747] CHIP:DMG: 				NodeId = 0xbc5c01,
2021-06-18T05:25:48.5824566Z [1623993948.179803][9747] CHIP:DMG: 				EndpointId = 0x1,
2021-06-18T05:25:48.5825006Z [1623993948.179806][9747] CHIP:DMG: 				ClusterId = 0x50f,
2021-06-18T05:25:48.5825405Z [1623993948.179809][9747] CHIP:DMG: 				FieldTag = 0x15,
2021-06-18T05:25:48.5825767Z [1623993948.179812][9747] CHIP:DMG: 			}
2021-06-18T05:25:48.5826078Z [1623993948.179816][9747] CHIP:DMG: 				
2021-06-18T05:25:48.5826415Z [1623993948.179819][9747] CHIP:DMG: 			Data = 0, 
2021-06-18T05:25:48.5826849Z [1623993948.179823][9747] CHIP:DMG: 			DataElementVersion = 0x0,
2021-06-18T05:25:48.5827274Z [1623993948.179825][9747] CHIP:DMG: 		},
2021-06-18T05:25:48.5827592Z [1623993948.179829][9747] CHIP:DMG: 		
2021-06-18T05:25:48.5827900Z [1623993948.179832][9747] CHIP:DMG: 	],
2021-06-18T05:25:48.5828217Z [1623993948.179836][9747] CHIP:DMG: 	
2021-06-18T05:25:48.5828520Z [1623993948.179839][9747] CHIP:DMG: }
2021-06-18T05:25:48.5828903Z [1623993948.179842][9747] CHIP:DMG: <RE> Sending report...
2021-06-18T05:25:48.5829374Z [1623993948.179845][9747] CHIP:EM: Piggybacking Ack for MsgId:0000009D with msg
2021-06-18T05:25:48.5829914Z [1623993948.179852][9747] CHIP:IN: Secure message was encrypted: Msg ID 81
2021-06-18T05:25:48.5830656Z [1623993948.179855][9747] CHIP:IN: Encrypted message 0x5568e89324c0 from 0x0000000000BC5C01 to 0x000000000001B669 of type 5 and protocolId 5 on exchange 39093.
2021-06-18T05:25:48.5831416Z [1623993948.179859][9747] CHIP:IN: Sending msg 0x5568e89324c0 to 0x000000000001B669 at utc time: 523586 msec
2021-06-18T05:25:48.5831968Z [1623993948.179862][9747] CHIP:IN: Sending secure msg on generic transport
2021-06-18T05:25:48.5832460Z [1623993948.179869][9747] CHIP:IN: Secure msg send status No Error
2021-06-18T05:25:48.5832955Z [1623993948.179875][9747] CHIP:DMG: IM RH moving to [Uninitialized]
2021-06-18T05:25:48.5833557Z [1623993948.179878][9747] CHIP:DMG: <RE> ReportsInFlight = 1 with readHandler 0, RE has no more messages
2021-06-18T05:25:48.5834208Z [1623993948.179881][9747] CHIP:DMG: <RE> OnReportConfirm: NumReports = 0
2021-06-18T05:25:48.5834751Z [1623993948.179884][9747] CHIP:DMG: IM RH moving to [Uninitialized]
2021-06-18T05:25:48.5835372Z [1623993948.179893][9747] CHIP:IN: Secure transport received message destined to fabric 0, node 0x0000000000BC5C01. Key ID 0
2021-06-18T05:25:48.5836085Z [1623993948.179899][9747] CHIP:EM: Received message of type 16 and protocolId 0 on exchange 39092
2021-06-18T05:25:48.5836774Z [1623993948.179946][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5837477Z [1623993948.179959][9763] CHIP:EM: Received message of type 16 and protocolId 0 on exchange 39093
2021-06-18T05:25:48.5838058Z [1623993948.179963][9763] CHIP:EM: Rxd Ack; Removing MsgId:0000009D from Retrans Table
2021-06-18T05:25:48.5838616Z [1623993948.179965][9763] CHIP:EM: Removed CHIP MsgId:0000009D from RetransTable
2021-06-18T05:25:48.5839275Z [1623993948.179975][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5839955Z [1623993948.179982][9763] CHIP:EM: Received message of type 5 and protocolId 5 on exchange 39093
2021-06-18T05:25:48.5840537Z [1623993948.179984][9763] CHIP:EM: CHIP MsgId:0000009D not in RetransTable
2021-06-18T05:26:00.1898549Z [1623993960.180145][9763] CHIP:DMG: Time out! failed to receive report data from Exchange: 39093

Note that:

2021-06-18T05:25:48.5815802Z [1623993948.179715][9747] CHIP:EM: Sending Standalone Ack for MsgId:0000009D
2021-06-18T05:25:48.5816327Z [1623993948.179726][9747] CHIP:IN: Secure message was encrypted: Msg ID 80

and

2021-06-18T05:25:48.5829374Z [1623993948.179845][9747] CHIP:EM: Piggybacking Ack for MsgId:0000009D with msg
2021-06-18T05:25:48.5829914Z [1623993948.179852][9747] CHIP:IN: Secure message was encrypted: Msg ID 81

The server sent two ACKs to the same message id, thus the client think this was a replay attack (?), and drop the second message.

2021-06-18T05:25:48.5836774Z [1623993948.179946][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5837477Z [1623993948.179959][9763] CHIP:EM: Received message of type 16 and protocolId 0 on exchange 39093
2021-06-18T05:25:48.5838058Z [1623993948.179963][9763] CHIP:EM: Rxd Ack; Removing MsgId:0000009D from Retrans Table
2021-06-18T05:25:48.5838616Z [1623993948.179965][9763] CHIP:EM: Removed CHIP MsgId:0000009D from RetransTable
2021-06-18T05:25:48.5839275Z [1623993948.179975][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5839955Z [1623993948.179982][9763] CHIP:EM: Received message of type 5 and protocolId 5 on exchange 39093
2021-06-18T05:25:48.5840537Z [1623993948.179984][9763] CHIP:EM: CHIP MsgId:0000009D not in RetransTable

Change overview

Adds a check in CRMP not to send ack flag if the ack was already sent.

Testing

Manually tested by cherry-picking this PR to [im] Land ReadAttribute using Interaction Model protocol #7634 , see:
https://github.com/erjiaqing/connectedhomeip/actions/runs/949064565

bzbarsky-apple

Can we add a test that exercises this? Specifically:

A test that exercises the actual situation that's at issue here, which is one side sending a message and the other side taking long enough to process that it sends a standalone ack and then a response. The response should not get dropped.
Ideally a test that would exercise the brokenness around HasPeerRequestedAck.

But also, it's not clear to me why we are completely stopping processing of a message that carries an ack to a message we don't think we want an ack for. That logic is not in the spec anywhere I can see (@turon am I just missing it there?) and we could end up in this situation simply by virtue of running out of CRMP retries for a message and then getting a response to it, with an ack, no?

bzbarsky-apple · 2021-06-18T07:57:33Z

src/messaging/ReliableMessageContext.cpp

@@ -136,6 +136,7 @@ CHIP_ERROR ReliableMessageContext::FlushAcks()
 #if !defined(NDEBUG)
            ChipLogDetail(ExchangeManager, "Flushed pending ack for MsgId:%08" PRIX32, mPendingPeerAckId);
 #endif
+            SetAckPending(false);


We should just have SendStandaloneAckMessage do the SetAckPending(false) bit. That's what other callers of it assume anyway, except for ReliableMessageMgr::ExecuteActions, which could remove its manual call.

if piggy back message does not do the SetAckPending(false), then a standalong ack message will be sent in the following MRP tick, right?

We have another issue since those flags might get manipulated from multiple threads, but we already have a separate thread to discuss the general racing issue.

if piggy back message does not do the SetAckPending(false),

But it does, right? That call is in ExchangeMessageDispatch::SendMessage where we piggyback.

So to be clearer: we should have SetAckPending(false) in piggyback where we have it now and in SendStandaloneAckMessage, which are the two places that can send acks. And we should not need it anywhere else. That means not in FlushAcks and not in ReliableMessageMgr::ExecuteActions.

bzbarsky-apple · 2021-06-18T07:57:59Z

src/messaging/ExchangeMessageDispatch.cpp

-    // If there is a pending acknowledgment piggyback it on this message.
-    if (reliableMessageContext->HasPeerRequestedAck())
+    // If there is a pending acknowledgment piggyback it on this message, and we have not send ACK before.
+    if (reliableMessageContext->HasPeerRequestedAck() && reliableMessageContext->IsAckPending())


This should just be testing reliableMessageContext->IsAckPending(). See #7339

Should we just use one flag to indicate there is a pending message needs to be acked to prevent confusion?

isAckPending should be the right thing, I agree.

Is it true that in SendMessage, we could just get rid of SetPeerRequestedAck and just call SetAckPending()? I think so. Then we could replace all calls to Set/HasPeerRequestedAck() completely, and just always use IsAckPending().

See snippet below:

CHIP_ERROR ExchangeMessageDispatch::OnMessageReceived(const PayloadHeader & payloadHeader, uint32_t messageId, const Transport::PeerAddress & peerAddress, ReliableMessageContext * reliableMessageContext) { ReturnErrorCodeIf(!MessagePermitted(payloadHeader.GetProtocolID().GetProtocolId(), payloadHeader.GetMessageType()), CHIP_ERROR_INVALID_ARGUMENT); if (IsReliableTransmissionAllowed()) { if (payloadHeader.IsAckMsg() && payloadHeader.GetAckId().HasValue()) { ReturnErrorOnFailure(reliableMessageContext->HandleRcvdAck(payloadHeader.GetAckId().Value())); } if (payloadHeader.NeedsAck()) { MessageFlags msgFlags; // An acknowledgment needs to be sent back to the peer for this message on this exchange, // Set the flag in message header indicating an ack requested by peer; msgFlags.Set(MessageFlagValues::kPeerRequestedAck); // Also set the flag in the exchange context indicating an ack requested; //////////// HERE REPLACE // reliableMessageContext->SetPeerRequestedAck(true); reliableMessageContext->SetAckPending(true); ///// END OF SUGGESTION ReturnErrorOnFailure(reliableMessageContext->HandleNeedsAck(messageId, msgFlags)); } } return CHIP_NO_ERROR; }

erjiaqing · 2021-06-18T08:08:46Z

Can we add a test that exercises this? Specifically:

A test that exercises the actual situation that's at issue here, which is one side sending a message and the other side taking long enough to process that it sends a standalone ack and then a response. The response should not get dropped.

Ideally a test that would exercise the brokenness around HasPeerRequestedAck.

But also, it's not clear to me why we are completely stopping processing of a message that carries an ack to a message we don't think we want an ack for. That logic is not in the spec anywhere I can see (@turon am I just missing it there?) and we could end up in this situation simply by virtue of running out of CRMP retries for a message and then getting a response to it, with an ack, no?

I found this logic is a bit hard to test, since it is a race condition infact. I also raised an issue in the spec repo about this logic, I'm not sure if this case is a possible replay attack, need some discussions for this.

I will go through the CRMP tests to see where can be used to add some hooks for the test.

@yufengwangca Do you have some idea about this?

github-actions · 2021-06-18T08:10:51Z

Size increase report for "gn_qpg6100-example-build" from 01ee80e

File	Section	File	VM
chip-qpg6100-lighting-example.out	.text	16	16

Full report output

BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-qpg6100-lighting-example.out and ./pull_artifact/chip-qpg6100-lighting-example.out:

sections,vmsize,filesize
.debug_info,0,163
.debug_loc,0,148
.debug_line,0,74
.debug_ranges,0,32
.text,16,16
.debug_frame,0,8
.debug_str,0,-1
[Unmapped],0,-16

Comparing ./master_artifact/chip-qpg6100-lighting-example.out.map and ./pull_artifact/chip-qpg6100-lighting-example.out.map:

BLOAT EXECUTION FAILED WITH CODE 1:
bloaty: unknown file type for file './pull_artifact/chip-qpg6100-lighting-example.out.map'

github-actions · 2021-06-18T08:39:18Z

Size increase report for "nrfconnect-example-build" from 01ee80e

File	Section	File	VM
chip-shell.elf	text	24	24
chip-shell.elf	device_handles	-8	-8
chip-lock.elf	text	28	28
chip-lock.elf	device_handles	4	4

Full report output

BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize
.debug_info,0,163
.debug_loc,0,148
.debug_line,0,73
.debug_ranges,0,32
text,24,24
.debug_frame,0,8
device_handles,-8,-8

Comparing ./master_artifact/chip-lock.elf and ./pull_artifact/chip-lock.elf:

sections,vmsize,filesize
.debug_info,0,163
.debug_loc,0,144
.debug_line,0,77
.debug_ranges,0,32
text,28,28
.debug_frame,0,8
device_handles,4,4

github-actions · 2021-06-18T08:39:34Z

Size increase report for "esp32-example-build" from 01ee80e

File	Section	File	VM
chip-all-clusters-app.elf	.flash.text	28	28

Full report output

BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-all-clusters-app.elf and ./pull_artifact/chip-all-clusters-app.elf:

sections,vmsize,filesize
.debug_info,0,44
.debug_line,0,29
.flash.text,28,28
.debug_str,0,-1
.debug_ranges,0,-8
.debug_loc,0,-56

Comparing ./master_artifact/chip-pigweed-app.elf and ./pull_artifact/chip-pigweed-app.elf:

sections,vmsize,filesize

yufengwangca · 2021-06-18T16:16:47Z

src/messaging/ExchangeMessageDispatch.cpp

@@ -49,8 +49,8 @@ CHIP_ERROR ExchangeMessageDispatch::SendMessage(SecureSessionHandle session, uin
    PayloadHeader payloadHeader;
    payloadHeader.SetExchangeID(exchangeId).SetMessageType(protocol, type).SetInitiator(isInitiator);

-    // If there is a pending acknowledgment piggyback it on this message.
-    if (reliableMessageContext->HasPeerRequestedAck())
+    // If there is a pending acknowledgment piggyback it on this message, and we have not send ACK before.


The problem here is the MRP tick fire before the response message from the app, and it only clear the isAckPending flag, but not HasPeerRequestedAck flag. So the following response message still piggy back the ack for the same message which has already been acked by a standalong ack message

turon · 2021-06-18T20:36:17Z

The server sent two ACKs to the same message id, thus the client think this was a replay attack (?), and drop the second message.

Regarding "the server sent two ACKs to the same message id", each should have a unique MessageCounter and the same AckMessageCounter. In other words, the message ids should not be the same unless the message is in fact a retransmission of the same exact message.

Regarding "the client think this was a replay attack", why would the client register these two separate messages as a replay attack? They are from the same source, but each has a unique MessageCounter. Ack processing and AckMessageCounter should never factor into replay protection. In fact, all the header fields used for Ack processing are in the encrypted portion of the message, so must not retroactively trigger rejection of an otherwise verified and accepted message.

yufengwangca · 2021-06-21T16:20:00Z

Can we add a test that exercises this? Specifically:

A test that exercises the actual situation that's at issue here, which is one side sending a message and the other side taking long enough to process that it sends a standalone ack and then a response. The response should not get dropped.

Ideally a test that would exercise the brokenness around HasPeerRequestedAck.

But also, it's not clear to me why we are completely stopping processing of a message that carries an ack to a message we don't think we want an ack for. That logic is not in the spec anywhere I can see (@turon am I just missing it there?) and we could end up in this situation simply by virtue of running out of CRMP retries for a message and then getting a response to it, with an ack, no?

I found this logic is a bit hard to test, since it is a race condition infact. I also raised an issue in the spec repo about this logic, I'm not sure if this case is a possible replay attack, need some discussions for this.

I will go through the CRMP tests to see where can be used to add some hooks for the test.

@yufengwangca Do you have some idea about this?

To test this broken logic, we need to delay the echo response and allow the standalong ack kick first.

woody-apple

Looks like we have no new tests, or updated tests that confirm this new logic. Can we have tests added to confirm this works, and doesn't break in future?

bzbarsky-apple · 2021-06-21T22:45:33Z

I found this logic is a bit hard to test, since it is a race condition infact

I agree that it's a pain... That's why I haven't put up the (simple) fix for #7339 yet: I have not had time to write the tests.

That said, src/messaging/tests/TestReliableMessageProtocol.cpp has some "sleep so we retransmit" bits already, though that may be kind of fragile if the timer thread does not wake up fast enough, I guess....

so must not retroactively trigger rejection of an otherwise verified and accepted message

Yes, agreed. We should fix that (separate pr?) in addition to whatever else we do here.

erjiaqing · 2021-06-25T10:25:26Z

Created #7909

With offline discuession with @kghost , we might do some refactor of CRMP, close this PR for now.

[crmp] Check IsAckPendingBefore send ACK

be2cc37

erjiaqing requested review from bzbarsky-apple, yufengwangca and pan-apple and removed request for bzbarsky-apple June 18, 2021 07:50

pullapprove bot requested review from andy31415, bzbarsky-apple, chrisdecenzo, Damian-Nordic, hawk248, jepenven-silabs and msandstedt June 18, 2021 07:50

pullapprove bot added the review - pending label Jun 18, 2021

bzbarsky-apple reviewed Jun 18, 2021

View reviewed changes

yufengwangca reviewed Jun 18, 2021

View reviewed changes

woody-apple requested changes Jun 21, 2021

View reviewed changes

erjiaqing closed this Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[crmp] Check IsAckPendingBefore send ACK #7740

[crmp] Check IsAckPendingBefore send ACK #7740

erjiaqing commented Jun 18, 2021 •

edited

bzbarsky-apple left a comment

bzbarsky-apple Jun 18, 2021

yufengwangca Jun 18, 2021

bzbarsky-apple Jun 21, 2021

bzbarsky-apple Jun 18, 2021

yufengwangca Jun 18, 2021

tcarmelveilleux Jun 18, 2021 •

edited

erjiaqing commented Jun 18, 2021

github-actions bot commented Jun 18, 2021

github-actions bot commented Jun 18, 2021

github-actions bot commented Jun 18, 2021

yufengwangca Jun 18, 2021

turon commented Jun 18, 2021 •

edited

yufengwangca commented Jun 21, 2021

woody-apple left a comment

bzbarsky-apple commented Jun 21, 2021

erjiaqing commented Jun 25, 2021

[crmp] Check IsAckPendingBefore send ACK #7740

[crmp] Check IsAckPendingBefore send ACK #7740

Conversation

erjiaqing commented Jun 18, 2021 • edited

Problem

Change overview

Testing

bzbarsky-apple left a comment

Choose a reason for hiding this comment

bzbarsky-apple Jun 18, 2021

Choose a reason for hiding this comment

yufengwangca Jun 18, 2021

Choose a reason for hiding this comment

bzbarsky-apple Jun 21, 2021

Choose a reason for hiding this comment

bzbarsky-apple Jun 18, 2021

Choose a reason for hiding this comment

yufengwangca Jun 18, 2021

Choose a reason for hiding this comment

tcarmelveilleux Jun 18, 2021 • edited

Choose a reason for hiding this comment

erjiaqing commented Jun 18, 2021

github-actions bot commented Jun 18, 2021

github-actions bot commented Jun 18, 2021

github-actions bot commented Jun 18, 2021

yufengwangca Jun 18, 2021

Choose a reason for hiding this comment

turon commented Jun 18, 2021 • edited

yufengwangca commented Jun 21, 2021

woody-apple left a comment

Choose a reason for hiding this comment

bzbarsky-apple commented Jun 21, 2021

erjiaqing commented Jun 25, 2021

erjiaqing commented Jun 18, 2021 •

edited

tcarmelveilleux Jun 18, 2021 •

edited

turon commented Jun 18, 2021 •

edited