Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Transactions #1673

Closed
GalRogozinski opened this issue Nov 26, 2019 · 20 comments
Closed

Cache Transactions #1673

GalRogozinski opened this issue Nov 26, 2019 · 20 comments
Assignees
Labels

Comments

@GalRogozinski
Copy link
Member

@GalRogozinski GalRogozinski commented Nov 26, 2019

Description

Cache only the transactions and transaction metadata

Requirements

  1. Each TransactionViewModel that is written to the DB is first recorded in cache
  2. Before attempting to read a transaction from the DB they must be first read from cache
  3. Treat "Fresh" and "Dirty" entries correctly
  4. Make the cache size 1000 for both tx and metadata
    6. Eviction policy FIFO
    7. Evict every 200 transactions
  5. Use soft weak references
@ovanwijk

This comment has been minimized.

Copy link
Contributor

@ovanwijk ovanwijk commented Nov 27, 2019

Can I ask what the reasoning is for a FIFO policy instead of an LRU policy?

@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Nov 27, 2019

The assumption is that IRI will most likely need the most recently received transactions for solidification and tip selection. So FIFO may do better.

@hmoog

This comment has been minimized.

Copy link
Contributor

@hmoog hmoog commented Nov 27, 2019

Btw. In goshimmer, we have moved away from a FIFO based LRU cache and implemented a db layer that gives us more control over things by "manually managing" which resources are still used by the program. This requires to write a little bit more code because you need to tell the program when you do not need things anymore, but it makes it 100% exact. Resources are only loaded from db exactly once and they are only persisted to the db exactly once as well (when you really don't need them anymore). And you can never have any kind of race conditions.

The LRU cache started to perform badly in times of high load because sometimes stuff got evicted from the cache that should have stayed in favor of a tx that just got loaded for a very brief moment to lookup some approvers or sth.

Simple use case:

// release resource automatically once the consume callback is done
storage.Load(myKey).Consume(func(item) {
   // do something with the item
})

More complex use case:

// manually manage resources (more code but you have more "control")
dbResult := storage.Load(myKey)
item := dbResult.Get()

// register another consumer so we can tell the program that an additional thread "uses" the item
dbResult.RegisterConsumer()
thread.Start(func() {
    // do something with the item in another thread

    // release the previously registered consumer
    dbResult.Release()
})

// release initial consumer
dbResult.Release()

All calls to storage.Load(myKey) will return exactly the same dbResult entry and therefore work on the exact same underlying cached entry. So you could also write the 2nd example like this:

storage.Load(myKey).Consume(func(item) {
    // do something with the item
    ...

    // do something else with the same item in another thread
    thread.Start(func() {
        storage.Load(myKey).Consume(func(item) {
            // do something else with the item
        })
    })
})

But if you want to go for raw performance then using the already existing object in the thread is a little bit faster because you save the additional cache lookup.

Note: Just as a comparison of how big of a performance impact this made - we went from being able to process around 20000 TPS to now around half a million TPS when it comes to storing and loading entries in the tangle.

I am not sure if this is sth that could be applied to IRI as well but its extremely efficient to "manually" manage resources instead of hoping for a generic LRU cache to makes "smart" decisions at all times about what should go and what should stay in the cache.

In addition, things that are "constantly" required will never get evicted while stuff that is only required for a single lookup will get evicted immediately.

@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Nov 27, 2019

@hmoog

in favor of a tx that just got loaded for a very brief moment to lookup some approvers or sth

This is why I wanted FIFO and not LRU
Could you expand why you concluded that LRU is superior?
I suppose you are covering cases when a new subtangle originates from a common root?

Resources are only loaded from db exactly once and they are only persisted to the db exactly once as well

I understand persisted exactly once. I assume you don't persist them until mutable metadata like solidity has been updated. Of course, then if you wait for them a long time to solidify they might poison the cache... Perhaps you meant that in a reasonable time period they persist only once and read only once (just want to make sure I understood you correctly).

Note: Just as a comparison of how big of a performance impact this made - we went from being able to process around 20000 TPS to now around half a million TPS when it comes to storing and loading entries in the tangle.

wow!

@ovanwijk

This comment has been minimized.

Copy link
Contributor

@ovanwijk ovanwijk commented Nov 27, 2019

@hmoog with a manual cache policy you would mean something like: keep all transactions between now and milestone - 15 cached?

This would ensure that what you KNOW will improve performance to be in cache. However, you will need some cache eviction policy so you are sure nodes start up again without out of memory errors if for example to coordinator is down.

I think this could be a good idea and then next to this have an LRU/FIFO cache for api calls for example. But that depends on the goal of the caching, if it is meant to speed up tip selection or speed up API use.

@hmoog

This comment has been minimized.

Copy link
Contributor

@hmoog hmoog commented Nov 27, 2019

We neither use FIFO nor LRU ... and there is also no real "default policy".

Let me maybe explain it in a different way.

A LRU cache and a FIFO cache are essentially the same thing but they make different assumptions about the underlying usage of data. You usually define the cache to have a certain fixed size and once the cache is full, you start evicting stuff:

  1. Things that got loaded first get evicted first (FIFO)
  2. Things that didn't get used for a long time get evicted first (LRU)

In both cases, the cache kind of "automagically" makes the decision for the user on what should be evicted and it is necessary because the cache only has a "limited" size.

The cache that we are using in goshimmer now does not make these kind of decisions for the user automatically but the program "informs" the cache once an item is not used by any running thread anymore - by calling "Release()" on the container of the cached item.

In low throughput scenarios, this cache will be extremely small and in high throughput scenarios it can "adjust" its size to the requirements of the node. The fact, that the node "tells" the cache when it does not require an item anymore allows it to "clean" up things much faster which turns out to result in even lower memory requirements than when using a vanilla FIFO or LRU cache.

We however still keep stuff around in the cache for a certain time (currently 1 second) before completely evicting it. This allows threads to work on things with a small delay (due to queues) without having to load it again.

This way of "manually" telling the cache when stuff is not needed any longer allows the cache to adjust its size dynamically with nothing ever being evicted prematurely.

So TL;DR: Instead of having to manually call "Release()" on your items, the vanilla LRU / FIFO cache automagically makes decisions for you by assumming a certain usage pattern of the underlying data. But this also creates a situation where the cache is sometimes to big and sometimes to small and never "really" fits your usage patterns,

@ovanwijk

This comment has been minimized.

Copy link
Contributor

@ovanwijk ovanwijk commented Nov 27, 2019

@hmoog I understand what you are doing. You are specifically solving cache misses. Which is, I think, a good way to go about it. But that only beneficial if you are constantly hammering your persistent layer, ie performance outruns your cache. Like continuously calling for tip selection. But if the resource is persisted and gets released(not in use by any thread) and right after called again you essentially have a cache miss again. The eviction policy is, not in use, not in cache.

I think doing this is quite a smart way of handling cache misses but what I am suggesting is that after all processes release the resource it then gets moved into a more standard FIFO or LRU cache. So when called sparsely (at least not simultaneously) you still have the benefits of the caching while solving cache misses(which really is a bottleneck on cache dependent systems) during high load times.

@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Nov 27, 2019

I think I am understanding you partially @hmoog

I understood how you prevent two copies of the same tx to exist in memory (very important).

What I don't understand is:
Once you are done processing it (no threads or go-routines are using it), you immediately persist it AND evict it from cache?

If I understood correctly, why immediately evict it from cache? Maybe someone will need it?

Anyhows, I think you inspired me a bit here...

Maybe I will change the logic to be what I understood from you, but instead of evicting from cache upon Release(), I will change reference from Strong to Soft. So it may be served from memory again if the GC didn't pick it up.

@hmoog

This comment has been minimized.

Copy link
Contributor

@hmoog hmoog commented Nov 27, 2019

I edited my comment - I think this makes it a bit more clear - please re-read it :D

@hmoog

This comment has been minimized.

Copy link
Contributor

@hmoog hmoog commented Nov 27, 2019

@ovanwijk please note that there is still a cache time and that entities do not get evicted immediately. So it still behaves like a LRU cache but instead of having to set a predefined size which will never 100% fit because you will see different loads, it dynamically grows and shrinks as needed.

I however think that a LRU cache is usually better than a vanilla FIFO because a LRU cache is a FIFO cache but stuff that permanently gets used (like the last milestone tx for example) doesnt get persisted and loaded over and over again just because it reached the end of the queue. It will simply stay in the cache until it becomes irrelevant.

The ability to not have to set a predefined size makes the caching "perfect" under any load without wasting space or evicting too often.

@ovanwijk

This comment has been minimized.

Copy link
Contributor

@ovanwijk ovanwijk commented Nov 28, 2019

@hmoog I see indeed with the updated comment. And I agree on LRU here.
One thing to still consider though is garbage collection times. My experience with JVM(no experience with GO GC) is that if you don't fix/control caching size is that when you think it will massively increase your performance and fill up your memory. When then a high performance event happens and you need that memory again then GC will kick in and kill your performance for 10 seconds or so. There are different policies of course but you really don't initially want your software to rely on GC settings.

It wouldn't be a blocker but it is something to consider. Just for the sake of argument: Cache misses and itermitted GC calls that throttle performance might be preferable over random long freezes.

Especially with JVM is that the longer applications run this behaviour chances over time, generally optimizing it though. But compile + test code isn't usually the best way to measure it's performance. You really need some high load and hit maximium memory allocations etc to do proper testing of caching behaviour.

Generally I think your approach is good but "perfect" due to garbage collection I am not sure yet ;)

@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Nov 28, 2019

@ovanwijk
I think using weak\soft references will solve the problem. Maybe weak refs are better because they will stick less in memory

@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Nov 28, 2019

  1. Each entry should have a dirty flag that signals whether the entry should be persisted.
  2. Each entry should have a fresh flag to signal whether it was ever in the DB.
  3. We should consider race conditions when accessing the cache and DB
  4. We will use only weak references for the cache. As long as some thread will hold it it will never be released by the GC.

We have made the cache configurable so it will be easy to change

@GalRogozinski GalRogozinski removed the L-Groom label Nov 28, 2019
@acha-bill acha-bill self-assigned this Dec 2, 2019
@acha-bill

This comment has been minimized.

Copy link
Member

@acha-bill acha-bill commented Dec 2, 2019

Implementation ideas.

Tangle

All caching operations would be in the Tangle managing a CacheManager.
We will also need some annotations. Like Cacheable that can be applied on TransactionViewModel.

New caches are created when a cache that does not exist is requested using CacheManager#getCache.

An example usage could look like this

/**
Loads the model from the DB or cache if the model is cacheable
If using cache, it first checks the cache and returns if exists. 
Else, loads from DB and puts the result in cache
**/
load(model, index) {
    if(isCacheable(model)){ //checks if the @Cacheable annotation is on the model
        cache = cacheManager.getCache(model);
        cacheResult = cache.get(index)
        if(cacheResult != null){
            return cacheResult;
        }
    }
    
    dbResult = get from DB
    
    if(isCacheable(model)){
        cache = cacheManager.getCache(model);
        cache.put(index, dbResult);
    }
    return dbResult;
}

/**
Persists the model to DB and also to cache if it's cacheable
**/
save(model, index){
    if(isCacheable(model)){
        cache = cacheManager.getCache(model)
        cache.put(index, model);
    }
    save to db
}

Cache

Looking at the cache implementation itself, it's instantiated with a CacheConfiguration and uses Guava's map maker with strong or weak references depending on the config.
The default is weak references. e.g

concurrentMap<K,V> store;
CacheImpl(cacheConfiguration) {
    //https://guava.dev/releases/23.0/api/docs/com/google/common/collect/MapMaker.html
    MapMaker mapMaker = new MapMaker().concurrencyLevel(4);
    if (cacheConfiguration.isWeakReference()) {
        mapMaker = mapMaker.weakKeys();
    }
    store = mapMaker.makeMap();
}

Cache metadata includes the fresh or dirty state of a cache entry and the idleness.
DEFAULT_TIME_TO_IDLE_SECONDS property on the cache config determines how long it takes for an entry that has not been used to become stale or idle. Idle entries must be evicted and should not be returned by get.

In the cache, we could maintain another map for each entries metadata. But this means we'll have 2 maps to maintain for every cache operation.

OR we wrap the cache values in a ValueWrapper that will store the metadata alongside actual value. i.e Cache<K,ValueWrapper<V>>

put(key, value){
    wrapper = store.get(key)
    if(wrapper != null){
        wrapper.setState(dirty)
    }else{
        wrapper = new ValueWrapper(value);
        wrapper.setState(fresh)
        wrapper.setLastAccessedAt(currentTime)
    }
    store.put(key, wrapper)
}
private put(key, wrapper){
    store.put(key, wrapper)
}
get(key){
    valueWrapper wrapper = store.get(key)
    if(!wrapper){
        return null;
    }
    wrapper.setLastAccessedAt(currentTime)
    put(key, wrapper)
    return wrapper.getValue();
}

Eviction by default is FIFO when the cache is full.
This is not timer-based, so cache#size will be checked before we cache#put.
If full, we evict the oldest DEFAULT_EVICTION_COUNT number of entries.
Only fresh and dirty entries will be persisted on the eviction.

cache#get operations will update the cache's metrics. hit and miss count.

Finally, on a tangle shutdown, all caches will be cleared to DB.

@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Dec 3, 2019

Thanks!

Tangle

  1. About the annotation. We need to make sure that reflection is not slowing us down too much. After all, performance is why we started this whole thing :-)
    At the moment I am unsure of what I am saying because maybe the JIT will optimize those calls.
    In the event that, in order to speed things up, we need to have a class scanner setting variables via reflection at boot, then we should consider doing it in a separate issue.
  2. If we put the annotations, it is better to do them on the Persistable types. We will get rid of all those ViewModels in the future hopefully.

Cache

  1. About idleness and eviction: From what I concluded from @hmoog comments it will be best to simply only use weak references. This is how we will imitate his lockResource idea. As long as someone references the object in the cache it will never be picked up by the GC.

If you insist to have Time To Idle, we can use soft references that turn weak, but not sure why that is needed and it will be harder to implement.

  1. About maintaining another map for metadata. Currently TransactionViewModel that you will cache holds both data and metadata, so not sure how you will add an extra map.

  2. Eviction by default is FIFO when the cache is full.

@ovanwijk had a comment that using the GC for eviction can slow us down and we need to control the cache size. However, even if we evict from the object from the map completely, it still stays in memory until GC kicks in, so I don't see a way around it.
In short, I now think using only weak keys will suffice and no need for an eviction policy, unless someone convinces me otherwise :-)

@acha-bill

This comment has been minimized.

Copy link
Member

@acha-bill acha-bill commented Dec 3, 2019

  1. Without reflection, we are left with checking
    if(model is TransactionViewModel) or if the model is in a list of cacheable types before caching.
  2. Yes. We can scrap idleness.
  3. TransactionViewModel holds transaction metadata. We can hold cache metadata either in a valuewrapper or another map as explained above.
@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Dec 3, 2019

  1. I think in Tangle we use Persistables so we can add a method isCacheable to them.
    We can later remove it when we do the annotations.
  2. I don't understand though, you are caching the entire TVM no? So it will hold the metadata
@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Jan 30, 2020

We combined the dirty and fresh flags into one.
I think it is best to call this new flag shouldPersist

@acha-bill

This comment has been minimized.

Copy link
Member

@acha-bill acha-bill commented Jan 30, 2020

We combined the dirty and fresh flags into one.
I think it is best to call this new flag shouldPersist

Done in #1729

@GalRogozinski

This comment has been minimized.

Copy link
Member Author

@GalRogozinski GalRogozinski commented Feb 13, 2020

Resolved by #1672

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.