Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[crmp] Check IsAckPendingBefore send ACK #7740

Conversation

erjiaqing
Copy link
Contributor

@erjiaqing erjiaqing commented Jun 18, 2021

Problem

In #7634, the test is failling

  • After go through the log, I found
2021-06-18T05:25:48.5815802Z [1623993948.179715][9747] CHIP:EM: Sending Standalone Ack for MsgId:0000009D
2021-06-18T05:25:48.5816327Z [1623993948.179726][9747] CHIP:IN: Secure message was encrypted: Msg ID 80
2021-06-18T05:25:48.5817018Z [1623993948.179729][9747] CHIP:IN: Encrypted message 0x7ffc7f445e70 from 0x0000000000BC5C01 to 0x000000000001B669 of type 16 and protocolId 0 on exchange 39093.
2021-06-18T05:25:48.5817750Z [1623993948.179733][9747] CHIP:IN: Sending msg 0x7ffc7f445e70 to 0x000000000001B669 at utc time: 523585 msec
2021-06-18T05:25:48.5818303Z [1623993948.179737][9747] CHIP:IN: Sending secure msg on generic transport
2021-06-18T05:25:48.5818813Z [1623993948.179755][9747] CHIP:IN: Secure msg send status No Error
2021-06-18T05:25:48.5819271Z [1623993948.179762][9747] CHIP:DMG: <RE:Run> Cluster 1295, Field 21 is dirty
2021-06-18T05:25:48.5819917Z [1623993948.179765][9747] CHIP:DMG: Received Cluster Command: Cluster=50f NodeId=0x0000000000BC5C01 Endpoint=1 FieldId=15 ListIndex=0
2021-06-18T05:25:48.5820563Z [1623993948.179771][9747] CHIP:DMG: <RE> Dumping report data...
2021-06-18T05:25:48.5820983Z [1623993948.179775][9747] CHIP:DMG: ReportData =
2021-06-18T05:25:48.5821349Z [1623993948.179778][9747] CHIP:DMG: {
2021-06-18T05:25:48.5821751Z [1623993948.179781][9747] CHIP:DMG: 	AttributeDataList =
2021-06-18T05:25:48.5822159Z [1623993948.179784][9747] CHIP:DMG: 	[
2021-06-18T05:25:48.5822595Z [1623993948.179787][9747] CHIP:DMG: 		AttributeDataElement =
2021-06-18T05:25:48.5823034Z [1623993948.179790][9747] CHIP:DMG: 		{
2021-06-18T05:25:48.5823409Z [1623993948.179793][9747] CHIP:DMG: 			AttributePath =
2021-06-18T05:25:48.5823789Z [1623993948.179796][9747] CHIP:DMG: 			{
2021-06-18T05:25:48.5824160Z [1623993948.179800][9747] CHIP:DMG: 				NodeId = 0xbc5c01,
2021-06-18T05:25:48.5824566Z [1623993948.179803][9747] CHIP:DMG: 				EndpointId = 0x1,
2021-06-18T05:25:48.5825006Z [1623993948.179806][9747] CHIP:DMG: 				ClusterId = 0x50f,
2021-06-18T05:25:48.5825405Z [1623993948.179809][9747] CHIP:DMG: 				FieldTag = 0x15,
2021-06-18T05:25:48.5825767Z [1623993948.179812][9747] CHIP:DMG: 			}
2021-06-18T05:25:48.5826078Z [1623993948.179816][9747] CHIP:DMG: 				
2021-06-18T05:25:48.5826415Z [1623993948.179819][9747] CHIP:DMG: 			Data = 0, 
2021-06-18T05:25:48.5826849Z [1623993948.179823][9747] CHIP:DMG: 			DataElementVersion = 0x0,
2021-06-18T05:25:48.5827274Z [1623993948.179825][9747] CHIP:DMG: 		},
2021-06-18T05:25:48.5827592Z [1623993948.179829][9747] CHIP:DMG: 		
2021-06-18T05:25:48.5827900Z [1623993948.179832][9747] CHIP:DMG: 	],
2021-06-18T05:25:48.5828217Z [1623993948.179836][9747] CHIP:DMG: 	
2021-06-18T05:25:48.5828520Z [1623993948.179839][9747] CHIP:DMG: }
2021-06-18T05:25:48.5828903Z [1623993948.179842][9747] CHIP:DMG: <RE> Sending report...
2021-06-18T05:25:48.5829374Z [1623993948.179845][9747] CHIP:EM: Piggybacking Ack for MsgId:0000009D with msg
2021-06-18T05:25:48.5829914Z [1623993948.179852][9747] CHIP:IN: Secure message was encrypted: Msg ID 81
2021-06-18T05:25:48.5830656Z [1623993948.179855][9747] CHIP:IN: Encrypted message 0x5568e89324c0 from 0x0000000000BC5C01 to 0x000000000001B669 of type 5 and protocolId 5 on exchange 39093.
2021-06-18T05:25:48.5831416Z [1623993948.179859][9747] CHIP:IN: Sending msg 0x5568e89324c0 to 0x000000000001B669 at utc time: 523586 msec
2021-06-18T05:25:48.5831968Z [1623993948.179862][9747] CHIP:IN: Sending secure msg on generic transport
2021-06-18T05:25:48.5832460Z [1623993948.179869][9747] CHIP:IN: Secure msg send status No Error
2021-06-18T05:25:48.5832955Z [1623993948.179875][9747] CHIP:DMG: IM RH moving to [Uninitialized]
2021-06-18T05:25:48.5833557Z [1623993948.179878][9747] CHIP:DMG: <RE> ReportsInFlight = 1 with readHandler 0, RE has no more messages
2021-06-18T05:25:48.5834208Z [1623993948.179881][9747] CHIP:DMG: <RE> OnReportConfirm: NumReports = 0
2021-06-18T05:25:48.5834751Z [1623993948.179884][9747] CHIP:DMG: IM RH moving to [Uninitialized]
2021-06-18T05:25:48.5835372Z [1623993948.179893][9747] CHIP:IN: Secure transport received message destined to fabric 0, node 0x0000000000BC5C01. Key ID 0
2021-06-18T05:25:48.5836085Z [1623993948.179899][9747] CHIP:EM: Received message of type 16 and protocolId 0 on exchange 39092
2021-06-18T05:25:48.5836774Z [1623993948.179946][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5837477Z [1623993948.179959][9763] CHIP:EM: Received message of type 16 and protocolId 0 on exchange 39093
2021-06-18T05:25:48.5838058Z [1623993948.179963][9763] CHIP:EM: Rxd Ack; Removing MsgId:0000009D from Retrans Table
2021-06-18T05:25:48.5838616Z [1623993948.179965][9763] CHIP:EM: Removed CHIP MsgId:0000009D from RetransTable
2021-06-18T05:25:48.5839275Z [1623993948.179975][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5839955Z [1623993948.179982][9763] CHIP:EM: Received message of type 5 and protocolId 5 on exchange 39093
2021-06-18T05:25:48.5840537Z [1623993948.179984][9763] CHIP:EM: CHIP MsgId:0000009D not in RetransTable
2021-06-18T05:26:00.1898549Z [1623993960.180145][9763] CHIP:DMG: Time out! failed to receive report data from Exchange: 39093

Note that:

2021-06-18T05:25:48.5815802Z [1623993948.179715][9747] CHIP:EM: Sending Standalone Ack for MsgId:0000009D
2021-06-18T05:25:48.5816327Z [1623993948.179726][9747] CHIP:IN: Secure message was encrypted: Msg ID 80

and

2021-06-18T05:25:48.5829374Z [1623993948.179845][9747] CHIP:EM: Piggybacking Ack for MsgId:0000009D with msg
2021-06-18T05:25:48.5829914Z [1623993948.179852][9747] CHIP:IN: Secure message was encrypted: Msg ID 81

The server sent two ACKs to the same message id, thus the client think this was a replay attack (?), and drop the second message.

2021-06-18T05:25:48.5836774Z [1623993948.179946][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5837477Z [1623993948.179959][9763] CHIP:EM: Received message of type 16 and protocolId 0 on exchange 39093
2021-06-18T05:25:48.5838058Z [1623993948.179963][9763] CHIP:EM: Rxd Ack; Removing MsgId:0000009D from Retrans Table
2021-06-18T05:25:48.5838616Z [1623993948.179965][9763] CHIP:EM: Removed CHIP MsgId:0000009D from RetransTable
2021-06-18T05:25:48.5839275Z [1623993948.179975][9763] CHIP:IN: Secure transport received message destined to fabric 0, node 0x000000000001B669. Key ID 0
2021-06-18T05:25:48.5839955Z [1623993948.179982][9763] CHIP:EM: Received message of type 5 and protocolId 5 on exchange 39093
2021-06-18T05:25:48.5840537Z [1623993948.179984][9763] CHIP:EM: CHIP MsgId:0000009D not in RetransTable

Change overview

Adds a check in CRMP not to send ack flag if the ack was already sent.

Testing

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test that exercises this? Specifically:

  1. A test that exercises the actual situation that's at issue here, which is one side sending a message and the other side taking long enough to process that it sends a standalone ack and then a response. The response should not get dropped.
  2. Ideally a test that would exercise the brokenness around HasPeerRequestedAck.

But also, it's not clear to me why we are completely stopping processing of a message that carries an ack to a message we don't think we want an ack for. That logic is not in the spec anywhere I can see (@turon am I just missing it there?) and we could end up in this situation simply by virtue of running out of CRMP retries for a message and then getting a response to it, with an ack, no?

@@ -136,6 +136,7 @@ CHIP_ERROR ReliableMessageContext::FlushAcks()
#if !defined(NDEBUG)
ChipLogDetail(ExchangeManager, "Flushed pending ack for MsgId:%08" PRIX32, mPendingPeerAckId);
#endif
SetAckPending(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just have SendStandaloneAckMessage do the SetAckPending(false) bit. That's what other callers of it assume anyway, except for ReliableMessageMgr::ExecuteActions, which could remove its manual call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if piggy back message does not do the SetAckPending(false), then a standalong ack message will be sent in the following MRP tick, right?

We have another issue since those flags might get manipulated from multiple threads, but we already have a separate thread to discuss the general racing issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if piggy back message does not do the SetAckPending(false),

But it does, right? That call is in ExchangeMessageDispatch::SendMessage where we piggyback.

So to be clearer: we should have SetAckPending(false) in piggyback where we have it now and in SendStandaloneAckMessage, which are the two places that can send acks. And we should not need it anywhere else. That means not in FlushAcks and not in ReliableMessageMgr::ExecuteActions.

// If there is a pending acknowledgment piggyback it on this message.
if (reliableMessageContext->HasPeerRequestedAck())
// If there is a pending acknowledgment piggyback it on this message, and we have not send ACK before.
if (reliableMessageContext->HasPeerRequestedAck() && reliableMessageContext->IsAckPending())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be testing reliableMessageContext->IsAckPending(). See #7339

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use one flag to indicate there is a pending message needs to be acked to prevent confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isAckPending should be the right thing, I agree.

Is it true that in SendMessage, we could just get rid of SetPeerRequestedAck and just call SetAckPending()? I think so. Then we could replace all calls to Set/HasPeerRequestedAck() completely, and just always use IsAckPending().

See snippet below:

CHIP_ERROR ExchangeMessageDispatch::OnMessageReceived(const PayloadHeader & payloadHeader, uint32_t messageId,
                                                      const Transport::PeerAddress & peerAddress,
                                                      ReliableMessageContext * reliableMessageContext)
{
    ReturnErrorCodeIf(!MessagePermitted(payloadHeader.GetProtocolID().GetProtocolId(), payloadHeader.GetMessageType()),
                      CHIP_ERROR_INVALID_ARGUMENT);

    if (IsReliableTransmissionAllowed())
    {
        if (payloadHeader.IsAckMsg() && payloadHeader.GetAckId().HasValue())
        {
            ReturnErrorOnFailure(reliableMessageContext->HandleRcvdAck(payloadHeader.GetAckId().Value()));
        }

        if (payloadHeader.NeedsAck())
        {
            MessageFlags msgFlags;

            // An acknowledgment needs to be sent back to the peer for this message on this exchange,
            // Set the flag in message header indicating an ack requested by peer;
            msgFlags.Set(MessageFlagValues::kPeerRequestedAck);

            // Also set the flag in the exchange context indicating an ack requested;
            //////////// HERE REPLACE
            // reliableMessageContext->SetPeerRequestedAck(true);
            reliableMessageContext->SetAckPending(true);            
            ///// END OF SUGGESTION

            ReturnErrorOnFailure(reliableMessageContext->HandleNeedsAck(messageId, msgFlags));
        }
    }

    return CHIP_NO_ERROR;
}

@erjiaqing
Copy link
Contributor Author

Can we add a test that exercises this? Specifically:

  1. A test that exercises the actual situation that's at issue here, which is one side sending a message and the other side taking long enough to process that it sends a standalone ack and then a response. The response should not get dropped.
  2. Ideally a test that would exercise the brokenness around HasPeerRequestedAck.

But also, it's not clear to me why we are completely stopping processing of a message that carries an ack to a message we don't think we want an ack for. That logic is not in the spec anywhere I can see (@turon am I just missing it there?) and we could end up in this situation simply by virtue of running out of CRMP retries for a message and then getting a response to it, with an ack, no?

I found this logic is a bit hard to test, since it is a race condition infact. I also raised an issue in the spec repo about this logic, I'm not sure if this case is a possible replay attack, need some discussions for this.

I will go through the CRMP tests to see where can be used to add some hooks for the test.

@yufengwangca Do you have some idea about this?

@github-actions
Copy link

Size increase report for "gn_qpg6100-example-build" from 01ee80e

File Section File VM
chip-qpg6100-lighting-example.out .text 16 16
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-qpg6100-lighting-example.out and ./pull_artifact/chip-qpg6100-lighting-example.out:

sections,vmsize,filesize
.debug_info,0,163
.debug_loc,0,148
.debug_line,0,74
.debug_ranges,0,32
.text,16,16
.debug_frame,0,8
.debug_str,0,-1
[Unmapped],0,-16

Comparing ./master_artifact/chip-qpg6100-lighting-example.out.map and ./pull_artifact/chip-qpg6100-lighting-example.out.map:

BLOAT EXECUTION FAILED WITH CODE 1:
bloaty: unknown file type for file './pull_artifact/chip-qpg6100-lighting-example.out.map'


@github-actions
Copy link

Size increase report for "nrfconnect-example-build" from 01ee80e

File Section File VM
chip-shell.elf text 24 24
chip-shell.elf device_handles -8 -8
chip-lock.elf text 28 28
chip-lock.elf device_handles 4 4
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize
.debug_info,0,163
.debug_loc,0,148
.debug_line,0,73
.debug_ranges,0,32
text,24,24
.debug_frame,0,8
device_handles,-8,-8

Comparing ./master_artifact/chip-lock.elf and ./pull_artifact/chip-lock.elf:

sections,vmsize,filesize
.debug_info,0,163
.debug_loc,0,144
.debug_line,0,77
.debug_ranges,0,32
text,28,28
.debug_frame,0,8
device_handles,4,4


@github-actions
Copy link

Size increase report for "esp32-example-build" from 01ee80e

File Section File VM
chip-all-clusters-app.elf .flash.text 28 28
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-all-clusters-app.elf and ./pull_artifact/chip-all-clusters-app.elf:

sections,vmsize,filesize
.debug_info,0,44
.debug_line,0,29
.flash.text,28,28
.debug_str,0,-1
.debug_ranges,0,-8
.debug_loc,0,-56

Comparing ./master_artifact/chip-pigweed-app.elf and ./pull_artifact/chip-pigweed-app.elf:

sections,vmsize,filesize


@@ -49,8 +49,8 @@ CHIP_ERROR ExchangeMessageDispatch::SendMessage(SecureSessionHandle session, uin
PayloadHeader payloadHeader;
payloadHeader.SetExchangeID(exchangeId).SetMessageType(protocol, type).SetInitiator(isInitiator);

// If there is a pending acknowledgment piggyback it on this message.
if (reliableMessageContext->HasPeerRequestedAck())
// If there is a pending acknowledgment piggyback it on this message, and we have not send ACK before.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is the MRP tick fire before the response message from the app, and it only clear the isAckPending flag, but not HasPeerRequestedAck flag. So the following response message still piggy back the ack for the same message which has already been acked by a standalong ack message

@turon
Copy link
Contributor

turon commented Jun 18, 2021

The server sent two ACKs to the same message id, thus the client think this was a replay attack (?), and drop the second message.

Regarding "the server sent two ACKs to the same message id", each should have a unique MessageCounter and the same AckMessageCounter. In other words, the message ids should not be the same unless the message is in fact a retransmission of the same exact message.

Regarding "the client think this was a replay attack", why would the client register these two separate messages as a replay attack? They are from the same source, but each has a unique MessageCounter. Ack processing and AckMessageCounter should never factor into replay protection. In fact, all the header fields used for Ack processing are in the encrypted portion of the message, so must not retroactively trigger rejection of an otherwise verified and accepted message.

@yufengwangca
Copy link
Contributor

Can we add a test that exercises this? Specifically:

  1. A test that exercises the actual situation that's at issue here, which is one side sending a message and the other side taking long enough to process that it sends a standalone ack and then a response. The response should not get dropped.
  2. Ideally a test that would exercise the brokenness around HasPeerRequestedAck.

But also, it's not clear to me why we are completely stopping processing of a message that carries an ack to a message we don't think we want an ack for. That logic is not in the spec anywhere I can see (@turon am I just missing it there?) and we could end up in this situation simply by virtue of running out of CRMP retries for a message and then getting a response to it, with an ack, no?

I found this logic is a bit hard to test, since it is a race condition infact. I also raised an issue in the spec repo about this logic, I'm not sure if this case is a possible replay attack, need some discussions for this.

I will go through the CRMP tests to see where can be used to add some hooks for the test.

@yufengwangca Do you have some idea about this?

To test this broken logic, we need to delay the echo response and allow the standalong ack kick first.

Copy link
Contributor

@woody-apple woody-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have no new tests, or updated tests that confirm this new logic. Can we have tests added to confirm this works, and doesn't break in future?

@bzbarsky-apple
Copy link
Contributor

I found this logic is a bit hard to test, since it is a race condition infact

I agree that it's a pain... That's why I haven't put up the (simple) fix for #7339 yet: I have not had time to write the tests.

That said, src/messaging/tests/TestReliableMessageProtocol.cpp has some "sleep so we retransmit" bits already, though that may be kind of fragile if the timer thread does not wake up fast enough, I guess....

so must not retroactively trigger rejection of an otherwise verified and accepted message

Yes, agreed. We should fix that (separate pr?) in addition to whatever else we do here.

@erjiaqing
Copy link
Contributor Author

Created #7909

With offline discuession with @kghost , we might do some refactor of CRMP, close this PR for now.

@erjiaqing erjiaqing closed this Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants