Update batch logic to only store+hash the manifest #582

peterbroadhurst · 2022-03-04T12:58:53Z

Resolves #506

A few notes on the implementation.

Improvements related to performance:

The hash on a batch is now just a hash of the manifest, rather than the full payload
- tx was added to the manifest to include in the hash
The database object for a batch is now the manifest
The manifest has been updated to include everything the batch aggregator needs to find pins
- A count of the topics were needed for this
We now have a cache for messages + all data associated with a message

Improvements related to debug:

Added to-string helpers to definition batch actions, and log the results
Added a GET /status/pins collection to peekinside the pins status

Migration:

The code copes with a persisted batch of the old type stored in the DB
- Version in manifest used to distinguish this, and provide future extensibility
The code copes with processing a batch that has a payload hash, rather than a manifest hash
- To handle late-join/re-sync to a network processing old broadcasts

Potential follow-on work:

Update the code in add fetchreferences param to events api #587 to use the message cache
Add a batch cache - to help the aggregator logic when managing pins
Add a transaction cache - to help add fetchreferences param to events api #587 event enrichment
Trawl for any remaining GetMessageByID calls

Message/data cache implementation notes

Messages have fields that are mutable, in two categories

Can change multiple times like state - you cannot rely on the cache for these
Can go from being un-set, to being set, and once set are immutable.

For (2) the cache provides a set of CacheReadOption modifiers that makes it safe to query the cache, even if the cache we slow to update asynchronously (active/active cluster being the ultimate example here, but from code inspection this is possible in the current cache).

If you use CRORequestBatchID then the cache will return a miss, if there is no BatchID set.

If you use CRORequirePins then the cache will return a miss, if the number of pins does not match the number of topics in the message.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

awrichar · 2022-03-04T16:41:55Z

pkg/fftypes/data.go

+		return nil
+	}
+	// For broadcast data the blob reference contains the "public" (shared storage) reference, which
+	// must have been allocated to this data item before sealing the batch.


This is the operative decision for #568 - I'm still wondering if it's possible for the public reference to be removed from the batch hash altogether. Perhaps it could be stored on the Blob instead of on the BlobRef?

It's mainly odd that private blob transfer happens after batch sealing, but public blob transfer must happen before batch sealing in order to include the IPFS ref. Seems like they should happen in the same order regardless.

When a node receives an IPFS ref, it does do some checking of the fetched contents to verify they match the blob hash before recording the blob as received. The question is whether this is "good enough" to say we can send IPFS refs without including them in the batch's hash proof.

Seems like they should happen in the same order regardless.

While I agree the discrepancy is annoying - I do not think this is possible, as IPFS does not have a "messaging" capability. It's just a storage system. So the blockchain has to be the messaging system in this case.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

…ache on upsert Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

codecov-commenter · 2022-03-08T15:31:56Z

Codecov Report

Merging #582 (8952a4f) into main (e7c080f) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main      #582    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          304       304            
  Lines        17544     17801   +257     
==========================================
+ Hits         17544     17801   +257

Impacted Files	Coverage Δ
internal/orchestrator/orchestrator.go	`100.00% <ø> (ø)`
internal/apiserver/route_get_batch_by_id.go	`100.00% <100.00%> (ø)`
internal/apiserver/route_get_batches.go	`100.00% <100.00%> (ø)`
internal/apiserver/route_get_data.go	`100.00% <100.00%> (ø)`
internal/apiserver/route_get_msg_data.go	`100.00% <100.00%> (ø)`
internal/apiserver/route_get_status_pins.go	`100.00% <100.00%> (ø)`
internal/batch/batch_manager.go	`100.00% <100.00%> (ø)`
internal/batch/batch_processor.go	`100.00% <100.00%> (ø)`
internal/batchpin/batchpin.go	`100.00% <100.00%> (ø)`
internal/batchpin/operations.go	`100.00% <100.00%> (ø)`
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7c080f...8952a4f. Read the comment docs.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

awrichar · 2022-03-10T19:58:39Z

internal/apiserver/route_get_status_pins.go

+
+var getStatusPins = &oapispec.Route{
+	Name:            "getStatusPins",
+	Path:            "status/pins",


Why are these under status? Doesn't feel entirely like a "status" object to me since it's just a collection listing... I guess the alternative would be a root endpoint though. Open to anything really, just wanted to call it out.

We could put them at the root, if we wanted to explain them in more detail as a first class object.
My thinking here was they are an internal read-only view of the state for problem diagnosis. Understand it's not perfect, and happy to discuss more.

Shall be merging through the chain with this here, but with an open-ness to move it in a future commit

internal/data/data_manager.go

internal/batch/batch_manager.go

internal/batch/batch_processor.go

awrichar · 2022-03-10T21:15:13Z

internal/batch/batch_processor.go

 				msgIDs[i] = msg.Header.ID
+				// We don't want to have to read the DB again if we want to query for the batch ID, or pins,
+				// so ensure the copy in our cache gets updated.
+				bp.data.UpdateMessageIfCached(ctx, msg)


Want to make sure it's OK that we update the batchID in the cache before we write the batchID to the database itself (ie in case we somehow fail to update the database)... I think it's OK and we would come back around and try to write the same batchID on the second try. Just wanted to put a note because I was staring at this for a bit.

Yep, the only way to avoid this would be to push all of the cache updating (in all cases) to post-commit actions.
I convinced myself that wasn't required, but happy to discuss more.

internal/events/event_dispatcher.go

internal/events/persist_batch.go

awrichar · 2022-03-10T21:38:00Z

internal/events/persist_batch.go

+		for di, dataRef := range msg.Data {
+			msgData[di] = dataByID[*dataRef.ID]
+			if msgData[di] == nil || !msgData[di].Hash.Equals(dataRef.Hash) {
+				log.L(ctx).Debugf("Message '%s' in batch '%s' - data not in-line in batch id='%s' hash='%s'", msg.Header.ID, batch.ID, dataRef.ID, dataRef.Hash)


Should this be higher than debug? Is this an expected situation?

The architecture prior to this code change allows it.
e.g. you could send some data to a party, then send a message referring to that data, without sending that data again.

A specific example would be a broadcast, followed by a private message.

awrichar · 2022-03-10T21:42:15Z

internal/privatemessaging/privatemessaging.go

 			fftypes.OpTypeDataExchangeBatchSend)
+		op.Input = fftypes.JSONObject{
+			"batch": tw.Batch.ID,
+		}


This is going to be overwritten by addBatchSendInputs below

Sorry, think this was a merge error

awrichar

Looks good - marking approved, although I did leave a few comments inline that are worth a look before you decide to merge.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst added 4 commits March 3, 2022 09:19

Initial data structures for batch manifest storage only change

1727abd

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Updates to structures

b57f45d

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Merge with main

df20f1b

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Update DB interaface to store manifest

b4a2956

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

awrichar reviewed Mar 4, 2022

View reviewed changes

peterbroadhurst added 8 commits March 4, 2022 12:06

Interim work on batch flush state interface

fe67f91

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Broadcast interface with batch.DispatchState

2b903a4

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Move over the private batch dispatcher

9f55e22

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Implement message cache

fa48b6b

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Move resolve data interface to return full data, so we can populate c…

6661b22

…ache on upsert Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Apply changes to batch structure to events

095c068

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Changes worked through all modules

d9844c5

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Merge branch 'main' of github.com:hyperledger/firefly into batch-upgrade

41a85dc

Message caching worked through e2e

e46b921

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst marked this pull request as ready for review March 8, 2022 19:27

peterbroadhurst requested review from nguyer and nickgaski as code owners March 8, 2022 19:27

peterbroadhurst mentioned this pull request Mar 10, 2022

Batching message writer routines, to use multi-value insert across API calls on message/data insert #592

Merged