Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor TxPool, improve its robustness #1802

Merged
merged 44 commits into from
May 28, 2020
Merged

Conversation

andreibancioiu
Copy link
Collaborator

  1. Score computation logic extracted to allow for easier change / algorithm replacements
  2. Some components decoupled from cacheConfig etc.
  3. bucketSortedMap does not call computeScore anymore, but receives the score instead
  4. txListForSender does not compute its score, it only computes the senderScoreParams.
  5. 3 and 4 allow us to reason more easily about critical sections within txListForSender.
  6. Fixed a possible uint32 overflow in RemoveTxsBulk (it would have a benign overflow, the value only used to monitor / diagnose eviction)
  7. Wrapped transaction selection, addition and removal in try / catch blocks.
  8. Removed deprecated cache implementation
  9. Add log with number of senders with failed selection (per selection)
  10. Add log with number of transactions to remove when block is finalized (sharded tx pool)

andreibancioiu and others added 6 commits May 17, 2020 21:54
# Conflicts:
#	dataRetriever/txpool/shardedTxPool.go
#	storage/txcache/config.go
#	storage/txcache/disabledCache.go
#	storage/txcache/txByHashMap.go
#	storage/txcache/txListBySenderMap_test.go
#	storage/txcache/txListForSender.go
#	storage/txcache/txListForSenderAsSortedMapItem_test.go
#	storage/txcache/txListForSender_test.go
#	storage/txcache/wrappedTransaction.go
core/tryCatch.go Outdated Show resolved Hide resolved
@@ -0,0 +1,65 @@
package preprocess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not create a sub-package called factory?
Why not splitting this file? Contains the factory instance and a disabled component. HArd to figure this out just reading the filename.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated at 8), this PR removes the old, deprecated sorting operations of the old transactions cache.

preprocess/transactions handles the TxCache using the Cacher interface. However, when it selects transactions at proposal time, it requires the selection operation, which the Cacher interface does not provide. This is why we needed the adapter in the first place. The adapter factory was switching between the new sortedTxsProvider and the old one - the old one was used if the cast cache.(*TxCache) failed (never the case in real life). Now, since we've removed the old components, we had to use a disabled substitute. Of course, this is suboptimal design. We actually have to fix transactions.go to depend on a well-defined interface instead, and discard the adapters and the adapter factory altogether. Not doable in this PR though, I've added a TODO now. I will try to figure out another way in the next PR, pool++cross.

Furthermore, the factory and the adapters were kept in separate files before this PR. I grouped them together to increase the cohesion of smelly things that we should remove sometimes soon.

Is this reasonable (that is, doing this non-trivial removal / change incrementally to avoid larger changes in the code at once)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is reasonable as it is for now.

storage/txcache/factory.go Show resolved Hide resolved
storage/txcache/score.go Outdated Show resolved Hide resolved
storage/txcache/txCacheFailsafe.go Outdated Show resolved Hide resolved
storage/txcache/txCacheFailsafe.go Outdated Show resolved Hide resolved
storage/txcache/txCacheFailsafe.go Outdated Show resolved Hide resolved
storage/txcache/txCacheFailsafe.go Outdated Show resolved Hide resolved
)

// createSortedTransactionsProvider is a "simple factory" for "SortedTransactionsProvider" objects
func createSortedTransactionsProvider(transactionsPreprocessor *transactions, cache storage.Cacher, cacheKey string) SortedTransactionsProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transactionsPreprocessor and cacheKey are not used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, removed leftovers :)

listForSender.scoreChunkMutex.Lock()
listForSender.scoreChunk = scoreChunk
listForSender.scoreChunkMutex.Unlock()
func (computer *disabledScoreComputer) computeScore(scoreParams senderScoreParams) uint32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused parameter, maybe use _

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


for _, txHash := range txHashes {
txMap.removeTx(string(txHash))
_, removed := txMap.removeTx(string(txHash))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the removeTx implementation seems a bit weak as for example if we concurrently try to remove the same txhash we might end up with a wrong count and wrong size (removal succeeds only once but we decrement the counter and the size twice)

Copy link
Collaborator Author

@andreibancioiu andreibancioiu May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch!

tx, ok := txMap.getTx(txHash) could return ok == true for both the concurring routines and both will decrement, which, indeed, is not correct.

The issue can appear if transactions are removed during eviction (on high load) and also removed (at the same time) on the "commit / finalize block" event.

In monitoring.go, in function diagnose() we have:

numTxsEstimate := int(cache.CountTx()) // this comes from the possibly wrong counter
numTxsInChunks := cache.txByHash.backingMap.Count()

If the estimate does not equal numTxsInChunks, a log.Debug() is emitted and we should see the difference. This diagnose() is unfortunately (currently) called only upon transaction selection (on my turn).

A possible solution to this issue: in txMap.backingMap.Remove, when the map chunk is locked (under mutex), do the "item exists" check and call a onItemRemovedCallback() to decrement the atomic counters. Would this be fine?

Another solution would be to apply corrections from time to time - that would work fine for the counter, but the corrections for the size counter would be quite inefficient.

I've added a TODO for this correction (it's an old condition).

Copy link
Contributor

@AdoAdoAdo AdoAdoAdo May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same, add the check in the Remove under lock and have the function return a bool if the remove was done.

@@ -9,6 +9,7 @@ import (
)

var _ storage.Cacher = (*TxCache)(nil)
var _ txCache = (*TxCache)(nil)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put and HasOrAdd have some unused parameters.

Copy link
Collaborator Author

@andreibancioiu andreibancioiu May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed / _ (these methods of the Cacher interface are not implemented by the TxCache). For disabledCache I've left them as they were, since we plan to remove that struct in a future PR anyway (see comments above). Is this fine?


// Remove delegates to backing cache
func (decorator *txCacheFailsafeDecorator) Remove(key []byte) {
decorator.RemoveTxByHash(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_= unhandled error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failSafe component was removed in the fix-after-review.

func (txMap *txListBySenderMap) notifyScoreChange(txList *txListForSender, scoreParams senderScoreParams) {
score := txMap.scoreComputer.computeScore(scoreParams)
txList.setLastComputedScore(score)
txMap.backingMap.NotifyScoreChange(txList, score)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 106 removeSender can act incorrectly if called concurrently for same sender, e.g removing the only tx for that sender and removing the sender due to low tx both will call removeSender. The removal is done only once but the counter is decremented twice.

Copy link
Collaborator Author

@andreibancioiu andreibancioiu May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! As above, added a TODO. Is this ok?

@@ -108,6 +108,11 @@ func (txMap *txListBySenderMap) removeSender(sender string) bool {
return false
}

// TODO / bugfix: when we concurrently try to remove the same sender,
// we might end up decrementing the counter twice.
// Possible solution: use an "onItemRemoved" callback in the "backingMap"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent double decreasing, might use a mutex or atomic function for that in order to make the decrementation concurrent safe.

Copy link
Contributor

@LucianMincu LucianMincu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System tests passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants