Add message queue to transaction `ingest` engine #2035

jordanschalm · 2022-02-19T00:28:16Z

This PR adds a message queue to the ingest engine to avoid blocking network layer threads when processing transactions.

Uses ComponentManager in ingest engine
Replace ProcessLocal with typed ProcessTransaction for accepting transactions from local source
Update ingress RPC server component
- Move to engine/collection/rpc, for consistency with the Access node RPC engine
- Define typed Backend for transaction processor dependency (the ingest engine) rather than a generic network.Engine)

* implement Component * use LifecycleManager

codecov-commenter · 2022-02-22T22:18:33Z

Codecov Report

Merging #2035 (4f30a46) into master (8e7dd3a) will decrease coverage by 0.02%.
The diff coverage is 43.90%.

@@            Coverage Diff             @@
##           master    #2035      +/-   ##
==========================================
- Coverage   56.99%   56.96%   -0.03%     
==========================================
  Files         635      635              
  Lines       36978    37038      +60     
==========================================
+ Hits        21076    21100      +24     
- Misses      13250    13278      +28     
- Partials     2652     2660       +8

Flag	Coverage Δ
unittests	`56.96% <43.90%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
module/component/component.go	`40.47% <ø> (ø)`
engine/collection/rpc/engine.go	`13.95% <7.14%> (ø)`
engine/collection/ingest/engine.go	`61.72% <48.14%> (-14.00%)`	⬇️
engine/collection/ingest/config.go	`100.00% <100.00%> (ø)`
...s/hotstuff/votecollector/staking_vote_processor.go	`89.47% <0.00%> (-3.51%)`	⬇️
consensus/hotstuff/eventloop/event_loop.go	`68.29% <0.00%> (-2.44%)`	⬇️
fvm/handler/contract.go	`75.32% <0.00%> (ø)`
...sus/approvals/assignment_collector_statemachine.go	`47.11% <0.00%> (+4.80%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e7dd3a...4f30a46. Read the comment docs.

zhangchiqing

Looks good

engine/collection/ingest/engine.go

…go into jordan/6164-ln-ingest-msg-q

engine/collection/ingest/engine.go

synzhu · 2022-02-23T21:33:35Z

engine/collection/ingest/engine.go

+	queue, err := fifoqueue.NewFifoQueue(
+		fifoqueue.WithCapacity(int(config.MaxMessageQueueSize)),
+		fifoqueue.WithLengthObserver(func(len int) {
+			mempoolMetrics.MempoolEntries(metrics.ResourceTransactionIngestQueue, uint(len))
+		}),
+	)


general question: Any reason we seem to prefer using fifoqueue in the codebase for this sort of thing? Aside from maybe the fact that it allows you to register a length observer?

I think in most cases buffered channels serve this purpose just as well, if not better (allows multiple consumers to consume from it at once), and it is simpler because you don't have to manually trigger the Notifier.

The only main benefit I see of using fifoqueue is that it could allow for a potentially infinitely growing queue, meaning we don't need to drop requests because of reaching capacity.

However, we eliminate this benefit if we restrict the capacity using WithCapacity.

The length observer is one, and also it is more flexible in terms of using different message stores (eg. a priority queue). We initially used buffered channels when first adding a per-engine message queue, and ultimately decided against it. Another reason was that the implementation we had initially had a lot of channel management boilerplate duplicated in each engine which was difficult to reason about if you're not already quite familiar with channels. Another alternative which we talked about (maybe worth revisiting) is to have the MessageStore expose a channel for getting the next message to process, rather than using the notifier. Then the MessageStore needs to maintain a goroutine to shovel messages into that channel, and to be startable/stoppable.

I remember a while back we discovered that the Notifier had a subtle bug. I don't remember the exact details, but basically if multiple messages are put into the queue concurrently, there's a race condition that could cause only a single notification to be emitted from the notifier. @AlexHentschel @durkmurder

So if we want to keep using this, we should either fix the Notifier or expose the channel as you suggested.

if multiple messages are put into the queue concurrently, there's a race condition that could cause only a single notification to be emitted from the notifier

This sounds like expected behaviour of the Notifier to me. If routine A and B attempt to Notify before routine C wakes to read the notification, one of the Notify calls from A or B will be a no-op and C will only observe one notification.

This is not the problem, the problem is when there are multiple consumers, e.g C1, C2, and C3. They could all be idle and waiting for a notification, but with the existing Notifier implementation, it is possible that a producer could come and insert 20 items into the queue and call Notify 20 times, yet only one of the consumers would be woken up. This is undesirable, since if there are 3 workers available, we would obviously want to awaken all of them.

Basically, the root of the issue with the existing Notifier implementation lies in the fact that in Go, all messages sent to a channel are delivered asynchronously, even when a consumer is already blocked on a read at the point when a send is initiated.

Therefore, in the scenario above, it's possible that when the producer sends notifications 2 to 20, notification 1 has still not been received by any receiver, even though all 3 receivers had already been blocked on a channel read prior to the first notification send. This results in notifications 2 to 20 being dropped on the floor.

You're right that one Notifier does not work to notify multiple consumers. It is intended to send a "wake" signal to a worker when "some work" is ready. Not to send one signal per unit of work. That is why the consumer, when awoken, will process all the work available.

I think exposing a channel as the the mechanism to read from a message queue is a better approach than what we have here.

engine/collection/ingest/engine.go

synzhu · 2022-02-23T21:42:03Z

Also, just tracking #2046 (comment) here since it also needs to updated on this PR

engine/collection/ingest/engine.go

Replace ProcessLocal with ProcessTransaction, check for shutdown signal in Process.

engine/collection/ingest/engine.go

(has the same effect as checking context cancellation, which we already do)

this makes it consistent with the Access RPC engine, and more obvious that this component belongs to the access node

previously the logger was never initialized

we have determined the network thread blocking issue is not caused by this engine - therefore removing the additional debug logging.

synzhu · 2022-03-07T22:00:54Z

engine/collection/ingest/engine.go

+	queue, err := fifoqueue.NewFifoQueue(
+		fifoqueue.WithCapacity(int(config.MaxMessageQueueSize)),
+		fifoqueue.WithLengthObserver(func(len int) {
+			mempoolMetrics.MempoolEntries(metrics.ResourceTransactionIngestQueue, uint(len))
+		}),
+	)


I remember a while back we discovered that the Notifier had a subtle bug. I don't remember the exact details, but basically if multiple messages are put into the queue concurrently, there's a race condition that could cause only a single notification to be emitted from the notifier. @AlexHentschel @durkmurder

So if we want to keep using this, we should either fix the Notifier or expose the channel as you suggested.

synzhu · 2022-03-07T22:20:55Z

engine/collection/ingest/engine.go

 // onTransaction handles receipt of a new transaction. This can be submitted
 // from outside the system or routed from another collection node.
 func (e *Engine) onTransaction(originID flow.Identifier, tx *flow.TransactionBody) error {

-	txID := tx.ID()
+	defer e.engMetrics.MessageHandled(metrics.EngineCollectionIngest, metrics.MessageTransaction)


This function is enormous, would it be possible to split it up a bit?

Split up in b2d3de8

durkmurder · 2022-03-09T10:07:08Z

engine/collection/ingest/engine.go

+func (e *Engine) Process(channel network.Channel, originID flow.Identifier, event interface{}) error {
+	select {
+	case <-e.ComponentManager.ShutdownSignal():
+		return component.ErrComponentShutdown


do we actually handle this sentinel somewhere? How about we just return nil in such case?

I would rather return an error than nil, since the message is not being processed. We shouldn't assume the caller knows the component is shut down - even if the error isn't handled, it at least gives visibility that the component is not in a healthy state (eg. in network layer logs, which is where this would currently surface).

If we return nil, and for some reason the node is not imminently shutting down, it's more difficult to see that there is a problem with this component.

I can't see where it will be useful, I mean in future iterations based on FLIP we won't return any error to network layer since it doesn't know what to do with it.

What about ProcessTransaction for another example? If another component calls ProcessTransaction and the component has been shut down, should we return an error or nil? I think we should return an error, for the same reasons mentioned in the last comment.

(Currently we don't handle this at all and just process the transaction normally, which I'll fix.)

I would say in case of ProcessTransaction it makes sense to return error since it's called from RPC and it can be reported to user.
Based on FLIP, I am mainly concerned about Process which is called directly from network layer and as I said, network layer have no clue what to do with error so it's better to omit it.

Updated in 2b6cd4f

durkmurder · 2022-03-09T10:08:26Z

engine/collection/ingest/engine.go

+			err := e.processAvailableMessages(ctx)
+			if err != nil {
+				// if an error reaches this point, it is unexpected
+				ctx.Throw(err)


Generally speaking Throw should terminate current goroutine but I would still add a return after to be safe

durkmurder · 2022-03-09T12:33:34Z

module/component/component.go

@@ -12,6 +12,9 @@ import (
 	"github.com/onflow/flow-go/module/util"
 )

+// ErrComponentShutdown is returned by a component which has already been shut down.
+var ErrComponentShutdown = fmt.Errorf("component has already shut down")


again, don't believe we handle it somewhere and don't see a practical usage for it, I mean even if we handle it we would most likely just ignore it either way?

durkmurder

LGTM. Clean code, thank you. Added a few minor comments.
You did some refactoring of RPC ingestion engine, maybe we should move it to ComponentManager as well? Could be done in separate PR as well.

* in Process, log a warning and return nil * in ProcessTransaction, return sentinel

begin adding ingest msg queeu

7a827c3

jordanschalm changed the title ~~begin adding ingest msg queeu~~ Add message queue to transaction ingest ingest engine Feb 19, 2022

jordanschalm added 6 commits February 18, 2022 17:43

refactor ingest engine

b1a3542

* implement Component * use LifecycleManager

more precise handling of no local cluster err

367769e

add back network.Engine methods

af4d240

use lm.OnStop and handle ctx.Done

58e017c

update tests

e79769d

add debug logging

c95e254

jordanschalm marked this pull request as ready for review February 22, 2022 22:13

jordanschalm requested review from synzhu, Kay-Zee and vishalchangrani February 22, 2022 22:13

jordanschalm requested a review from zhangchiqing February 22, 2022 22:33

update received metric for blocking calls

9681f87

jordanschalm changed the title ~~Add message queue to transaction ingest ingest engine~~ Add message queue to transaction ingest engine Feb 22, 2022

jordanschalm mentioned this pull request Feb 22, 2022

[Backport to v0.24] Add message queue to transaction ingest engine #2046

Merged

zhangchiqing approved these changes Feb 22, 2022

View reviewed changes

engine/collection/ingest/engine.go Outdated Show resolved Hide resolved

jordanschalm added 3 commits February 23, 2022 08:48

Merge branch 'master' into jordan/6164-ln-ingest-msg-q

844f209

debug logs

6c07a73

Merge branch 'jordan/6164-ln-ingest-msg-q' of github.com:onflow/flow-…

8384683

…go into jordan/6164-ln-ingest-msg-q

synzhu reviewed Feb 23, 2022

View reviewed changes

peterargue reviewed Feb 23, 2022

View reviewed changes

engine/collection/ingest/engine.go Outdated Show resolved Hide resolved

engine/collection/ingest/engine.go Show resolved Hide resolved

synzhu reviewed Feb 23, 2022

View reviewed changes

engine/collection/ingest/engine.go Outdated Show resolved Hide resolved

jordanschalm added 2 commits February 25, 2022 11:48

use component manager in ingest engine

5716cf3

use typed local process method

4a5bc6d

Replace ProcessLocal with ProcessTransaction, check for shutdown signal in Process.

peterargue reviewed Feb 25, 2022

View reviewed changes

engine/collection/ingest/engine.go Outdated Show resolved Hide resolved

jordanschalm added 2 commits February 25, 2022 14:06

remove check of shutdown signal

e34cc4d

(has the same effect as checking context cancellation, which we already do)

move ingress to rpc engine

08ae76e

this makes it consistent with the Access RPC engine, and more obvious that this component belongs to the access node

jordanschalm added 6 commits February 25, 2022 14:51

inject logger to LN rpc engine

d33c3ce

previously the logger was never initialized

remove trace debug logs

ce2ba09

we have determined the network thread blocking issue is not caused by this engine - therefore removing the additional debug logging.

add new mock generation

1d29e0b

renaming ingress->rpc in collection/main

f0d5949

Merge branch 'master' into jordan/6164-ln-ingest-msg-q

50c0087

cancel compnent context in mock nodes

d0d2530

jordanschalm requested review from synzhu, zhangchiqing and peterargue February 26, 2022 00:07

jordanschalm added 2 commits February 25, 2022 16:38

Merge branch 'master' into jordan/6164-ln-ingest-msg-q

3ffbc9b

Merge branch 'master' into jordan/6164-ln-ingest-msg-q

b12f2d7

jordanschalm removed the request for review from Kay-Zee March 4, 2022 16:56

synzhu approved these changes Mar 7, 2022

View reviewed changes

durkmurder reviewed Mar 9, 2022

View reviewed changes

durkmurder approved these changes Mar 9, 2022

View reviewed changes

jordanschalm added 4 commits March 9, 2022 11:19

Merge branch 'master' into jordan/6164-ln-ingest-msg-q

b837565

ingest: split up onTransaction

b2d3de8

change errors when component is shutdown

2b6cd4f

* in Process, log a warning and return nil * in ProcessTransaction, return sentinel

Merge branch 'master' into jordan/6164-ln-ingest-msg-q

4f30a46

jordanschalm merged commit 57f89d4 into master Mar 10, 2022

jordanschalm deleted the jordan/6164-ln-ingest-msg-q branch March 10, 2022 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add message queue to transaction `ingest` engine #2035

Add message queue to transaction `ingest` engine #2035

jordanschalm commented Feb 19, 2022 •

edited

codecov-commenter commented Feb 22, 2022 •

edited

zhangchiqing left a comment

synzhu Feb 23, 2022

synzhu Feb 23, 2022

jordanschalm Feb 24, 2022

synzhu Mar 7, 2022 •

edited

jordanschalm Mar 9, 2022

synzhu May 12, 2022

synzhu May 12, 2022 •

edited

jordanschalm May 12, 2022

synzhu commented Feb 23, 2022

synzhu Mar 7, 2022 •

edited

synzhu Mar 7, 2022

jordanschalm Mar 9, 2022

durkmurder Mar 9, 2022

jordanschalm Mar 9, 2022 •

edited

durkmurder Mar 9, 2022

jordanschalm Mar 9, 2022

durkmurder Mar 9, 2022

jordanschalm Mar 10, 2022

durkmurder Mar 9, 2022

durkmurder Mar 9, 2022

durkmurder left a comment

Add message queue to transaction ingest engine #2035

Add message queue to transaction ingest engine #2035

Conversation

jordanschalm commented Feb 19, 2022 • edited

codecov-commenter commented Feb 22, 2022 • edited

Codecov Report

zhangchiqing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

synzhu Mar 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

synzhu May 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

synzhu commented Feb 23, 2022

synzhu Mar 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm Mar 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

durkmurder left a comment

Choose a reason for hiding this comment

Add message queue to transaction `ingest` engine #2035

Add message queue to transaction `ingest` engine #2035

jordanschalm commented Feb 19, 2022 •

edited

codecov-commenter commented Feb 22, 2022 •

edited

synzhu Mar 7, 2022 •

edited

synzhu May 12, 2022 •

edited

synzhu Mar 7, 2022 •

edited

jordanschalm Mar 9, 2022 •

edited