Ruby recycle and stall_and_wait message delivery errors #800
-
Describe the bug I am implementing cache coherence protocols using Ruby and have encountered issues with both the "recycle" and "stall_and_wait" functions that cause deadly protocol errors. I have attempted to use both recycle and stall_and_wait to deal with messages that cannot be currently processed by the coherence protocol (referred to as "deferred messages" below). I describe the individual issues below: recycle: recycle moves the deferred request from the head of the queue to the tail to allow other messages to be processed immediately. I have encountered a case where this reordered deferred messages on the same message channel (for example, when there were two messages on the same network). When the receiving cache can later observe the deferred messages, the messages are observed out of order, causing the coherence protocol to believe the later message was sent before the earlier message. This causes a protocol deadlock in Ruby. stall_and_wait: stall_and_wait moves the deferred request into a separate structure tagged by an address. I have found that if there coherence protocol attempts to defer multiple messages using stall_and_wait, the earlier message which was deferred is overwritten by later calls to stall_and_wait. This causes the earlier message to disappear entirely, causing obvious issues in the coherence protocol. Affects version Version 22.1.0.0 gem5 Modifications I have implemented a custom coherence protocol using Ruby. To Reproduce Unfortunately, I am not sure how to reproduce these bugs without our coherence protocol. However, I believe it is inherent in the design of these mechanisms based on the source code that these errors could arise if multiple messages arrive on the same channel and are deferred. Terminal Output #Terminal output here# Expected behavior Host Operating System Host ISA Compiler used Additional information |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Thanks for bringing this to the attention of the gem5 community! I think that most of the behavior you're seeing is expected (though there may be a bug if messages are overwritten in the stall queue). So, until we have more evidence of a bug (e.g., a reproducible error or some specific code that could be the culprit), I'm going to convert this to a discussion. I've included some further thoughts below.
This is expected behavior in Ruby. In fact, generally it is best practice that your protocol can deal with messages out of order. The default for the network is to also allow messages to appear out of order (e.g., what happens if a lower priority virtual channel is more congested than a higher priority virtual channel).
This could be a bug. As far as I know, when you use That said, if you use |
Beta Was this translation helpful? Give feedback.
The code in question
I have to say, this code is incredibly confusing! Here's the best understanding I have.
m_waiting_buffers
is astd::map<Addr, std::vector<MessageBuffer*>* >
So, it's "just" a map from addresses to a set of buffer pointers we need to…