Cache housekeeper can panic (seen after 50M+ insertions) #267

fwojtan · 2023-05-26T14:34:45Z

I wrote a script to try and expose a different issue I thought might be caused by moka and encountered an unhandled panic. This happened somewhere between 54.8M and 54.9M insertions. I was asynchronously inserting items from a bunch of tasks to try and replicate the behaviour of our real-world system.

The panic I saw was:

thread 'moka-housekeeper-5' panicked at 'attempt to add with overflow', /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/frequency_sketch.rs:183:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/core/src/panicking.rs:114:5
   3: moka::common::frequency_sketch::FrequencySketch::index_of
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/frequency_sketch.rs:183:9
   4: moka::common::frequency_sketch::FrequencySketch::increment
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/frequency_sketch.rs:142:25
   5: moka::sync_base::base_cache::Inner<K,V,S>::apply_reads
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/sync_base/base_cache.rs:1260:35
   6: <moka::sync_base::base_cache::Inner<K,V,S> as moka::common::concurrent::housekeeper::InnerSync>::sync
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/sync_base/base_cache.rs:1115:17
   7: moka::common::concurrent::housekeeper::ThreadPoolHousekeeper<T>::call_sync
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/concurrent/housekeeper.rs:352:29
   8: moka::common::concurrent::housekeeper::ThreadPoolHousekeeper<T>::try_schedule_sync::{{closure}}
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/concurrent/housekeeper.rs:328:21
   9: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/core/src/ops/function.rs:250:5
  10: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/alloc/src/boxed.rs:1988:9
  11: scheduled_thread_pool::Worker::run_job
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:426:33
  12: scheduled_thread_pool::Worker::run::{{closure}}
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:381:61
  13: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/core/src/panic/unwind_safe.rs:271:9
  14: std::panicking::try::do_call
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panicking.rs:483:40
  15: __rust_try
  16: std::panicking::try
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panicking.rs:447:19
  17: std::panic::catch_unwind
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panic.rs:140:14
  18: scheduled_thread_pool::Worker::run
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:381:21
  19: scheduled_thread_pool::Worker::start::{{closure}}
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:375:30
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread 'moka-housekeeper-7' panicked at 'attempt to add with overflow', /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/frequency_sketch.rs:183:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/core/src/panicking.rs:114:5
   3: moka::common::frequency_sketch::FrequencySketch::index_of
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/frequency_sketch.rs:183:9
   4: moka::common::frequency_sketch::FrequencySketch::frequency
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/frequency_sketch.rs:122:25
   5: moka::sync_base::base_cache::EntrySizeAndFrequency::add_frequency
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/sync_base/base_cache.rs:728:22
   6: moka::sync_base::base_cache::Inner<K,V,S>::handle_upsert
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/sync_base/base_cache.rs:1358:9
   7: moka::sync_base::base_cache::Inner<K,V,S>::apply_writes
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/sync_base/base_cache.rs:1285:23
   8: <moka::sync_base::base_cache::Inner<K,V,S> as moka::common::concurrent::housekeeper::InnerSync>::sync
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/sync_base/base_cache.rs:1120:17
   9: moka::common::concurrent::housekeeper::ThreadPoolHousekeeper<T>::call_sync
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/concurrent/housekeeper.rs:352:29
  10: moka::common::concurrent::housekeeper::ThreadPoolHousekeeper<T>::start_periodical_sync_job::{{closure}}
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/moka-0.10.0/src/common/concurrent/housekeeper.rs:281:45
  11: <alloc::boxed::Box<F,A> as core::ops::function::FnMut<Args>>::call_mut
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/alloc/src/boxed.rs:1995:9
  12: scheduled_thread_pool::Worker::run_job
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:456:43
  13: scheduled_thread_pool::Worker::run::{{closure}}
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:381:61
  14: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/core/src/panic/unwind_safe.rs:271:9
  15: std::panicking::try::do_call
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panicking.rs:483:40
  16: __rust_try
  17: std::panicking::try
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panicking.rs:447:19
  18: std::panic::catch_unwind
             at /rustc/a74ca7e36d5baf0684acd2159264adc64f81786d/library/std/src/panic.rs:140:14
  19: scheduled_thread_pool::Worker::run
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:381:21
  20: scheduled_thread_pool::Worker::start::{{closure}}
             at /opt/rust/registry/src/git.datcon.co.uk-91de7a0a7768d16b/scheduled-thread-pool-0.2.7/src/lib.rs:375:30
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Script that caused the above panic:

#[tokio::main]
async fn main() {
    let evictions_counter = Arc::new(Mutex::new(0));
    let insertions_counter = Arc::new(Mutex::new(0));
    let cache = moka::future::Cache::builder()
        .eviction_listener_with_queued_delivery_mode({
            let evictions_counter = evictions_counter.clone();
            move |_k, _v, _c| {
                *evictions_counter.lock().unwrap() += 1;
            }
        })
        .max_capacity(750000000)
        .weigher(|_, _| 12000)
        .build();

    for j in 0..10 {
        tokio::spawn({
            println!("spawned load task");
            let cache = cache.clone();
            let insertions_counter = insertions_counter.clone();
            let j = j.to_owned();
            async move {
                for i in 1..10000000 {
                    cache
                        .get_with((i, j), async { String::with_capacity(10000) })
                        .await;
                    *insertions_counter.lock().unwrap() += 1;
                }
            }
        });
    }

    println!("inserted: \t, size now: \t, evicted: \t, size+evictions: \t, diff from inserted: \t",);
    loop {
        let insertions = insertions_counter.lock().unwrap().to_owned();
        if insertions % 100000 == 0 {
            let entries = cache.iter().count() as i32;
            let evictions = *evictions_counter.lock().unwrap();
            println!(
                "{},\t\t{},\t\t{},\t\t{},\t\t\t{}",
                insertions,
                entries,
                evictions,
                entries + evictions,
                insertions - entries - evictions
            );
        }
    }
}

I expect this is fairly rare. I've run the script again since and it got past 70M inserts before hitting the issue. I'm not invested in this being fixed so up to you what you want to do with it. Figured it was worth reporting though.

If you're interested I can share more context on the original cache issue I was investigating when I found this (cache capacity slowly reducing as more and more size evictions happen). We dropped our investigation and changed our approach to save time, so I don't think I have enough evidence to raise an issue and confidently point the finger at moka, although at the time we suspected this issue was with this library.

The text was updated successfully, but these errors were encountered:

tatsuya6502 · 2023-05-27T01:13:55Z

Thank you for reporting the issue!

... attempt to add with overflow', 
... /moka-0.10.0/src/common/frequency_sketch.rs:183:9

moka/blob/v0.10.0/src/common/frequency_sketch.rs#L180-L183

180: fn index_of(&self, hash: u64, depth: u8) -> usize {
181:     let i = depth as usize;
182:     let mut hash = hash.wrapping_add(SEED[i]).wrapping_mul(SEED[i]);
183:     hash += hash >> 32;

I think the line 183 should be changed to:

183:     hash = hash.wrapping_add(hash >> 32);

so that a wrapping (modular) addition is allowed in debug build (where the overflow checks are enabled).

Let me double check and apply the change.

(Duplicate of #113)

tatsuya6502 · 2023-05-28T02:59:48Z

I expect this is fairly rare. I've run the script again since and it got past 70M inserts before hitting the issue. I'm not invested in this being fixed so up to you what you want to do with it. Figured it was worth reporting though.

Thanks again for reporting it. I fixed it via #272. I will publish v0.11.1 to crates.io soon with the fix.

Here are some details about the bug:

Moka cache calculates the hash value (u64) of a cache key. The bug in FrequencySketch::index_of method caused panics on some hash values. The panic happens on arithmetic overflow in a + operation.

It is very rare to hit the bug, because:

The chance to get the hash value causing the panic is very low.
- With my simple test that generates random hash values, I could only trigger the panic once in ~1 billion hash values.
The panic only happens when cache is compiled in debug mode.
- In release mode, arithmetic overflow check is disabled (by default), so the panic ("attempt to add with overflow") never happens but the CPU silently does wrapping addition.
- NOTE: Wrapping addition is the desired behavior for index_of. The fix is to use wrapping_add explicitly instead of +. In release mode, index_of was already working as expected because of the CPU behavior.

The reason that your script does not hit the bug in every runs is that the hash values for the same keys are different in every runs. The following code snippet shows the hash values for the same keys in different runs:

use std::hash::{BuildHasher, Hash, Hasher};

let key = (0, 0);

// Moka cache uses `RandomState` as the default `BuildHasher`. When a cache is
// created, a `RandomState` instance is created with a random seed.
//
// Here, let's create a `RandomState` instance directly, without creating a cache.
let hash_builder = std::collections::hash_map::RandomState::new();

// Calculate the hash value of the key, just like a cache does.
let mut hasher = hash_builder.build_hasher();
key.hash(&mut hasher);
let hash = hasher.finish();

// Print the hash value. Since `RandomState` is seeded by a random value, the
// hash value is different in each run.
println!("key: {key:?}, hash: {hash}");

$ cargo run
key: (0, 0), hash: 6658444597857261670

$ cargo run
key: (0, 0), hash: 6848010251626297664

$ cargo run
key: (0, 0), hash: 8403121289877648902

Finally, here is the code of the simple test I used to trigger the bug:

hash-finder

// Cargo.toml
// ---------------------------------------------
// [dependencies]
// getrandom = "0.2.9"
// once_cell = "1.17.1"
// rand = { version = "0.8.5", features = ["small_rng"] }
//
// # Enable overflow checks in release builds.
// [profile.release]
// overflow-checks = true
// ---------------------------------------------

use std::num::NonZeroUsize;

use rand::rngs::SmallRng;
use rand::{Rng, SeedableRng};

// Please download this file having the bug:
// https://raw.githubusercontent.com/moka-rs/moka/v0.11.0/src/common/frequency_sketch.rs
//
// and place it into your `src` directory.
//
mod frequency_sketch;

fn main() {
    let num_cores = std::thread::available_parallelism().unwrap_or(NonZeroUsize::new(1).unwrap());
    let num_threads = num_cores.get() / 2;

    // Spawn threads to search for hashes that trigger a panic.
    let handles = (0..num_threads)
        .map(|thread_num| {
            std::thread::Builder::new()
                .name(format!("thread-{}", thread_num))
                .spawn(move || find_hashes(thread_num))
                .unwrap()
        })
        .collect::<Vec<_>>();

    // Wait for the threads to finish and collect their results.
    let results = handles
        .into_iter()
        .map(|handle| handle.join().unwrap())
        .collect::<Vec<_>>();

    let attempts = results.iter().map(|(attempts, _)| *attempts).sum::<u64>();
    let mut hashes = results
        .iter()
        .map(|(_, hashes)| hashes)
        .flatten()
        .collect::<Vec<_>>();
    hashes.sort_unstable();
    hashes.dedup();

    println!("-------------------------------------");
    println!(
        "number of threads: {}, total attempts: {}, hashes found: {}, chances: once in {:.2} attempts",
        num_threads,
        attempts,
        hashes.len(),
        attempts as f64 / hashes.len() as f64
    );

    for hash in hashes {
        println!("{hash}");
    }
}

fn find_hashes(thread_num: usize) -> (u64, Vec<u64>) {
    const NUM_ATTEMPTS: u64 = 2u64.pow(32);

    // The hashes triggering a panic.
    let mut hashes_found = vec![];
    let mut rng = SmallRng::from_entropy();

    let mut sketch = frequency_sketch::FrequencySketch::default();
    // The capacity does not matter for this bug.
    sketch.ensure_capacity(1024);

    for _ in 0..NUM_ATTEMPTS {
        // Generate a random hash and check if it triggers a panic.
        let hash = rng.gen();
        match std::panic::catch_unwind(|| sketch.frequency(hash)) {
            Ok(_) => (),
            Err(_) => {
                println!("{} - FOUND: hash: {}", thread_num, hash);
                hashes_found.push(hash);
            }
        }
    }

    (NUM_ATTEMPTS, hashes_found)
}

and the output:

## In this program, release mode will panic too because the overflow
## checks are enabled in Cargo.toml.

$ cargo run --release
...

thread 'thread-3' panicked at 'attempt to add with overflow', src/frequency_sketch.rs:185:9
3 - FOUND: hash: 6545676238230191093
thread 'thread-3' panicked at 'attempt to add with overflow', src/frequency_sketch.rs:185:9
3 - FOUND: hash: 15599614373087607955
...

-------------------------------------
number of threads: 4, total attempts: 17179869184, hashes found: 16, chances: once in 1073741824.00 attempts
175235467969503931
350323424204063419
2393337716848854001
2607510736887440732
3614626907659264335
4624000782985340440
4651654716317629850
6545676238230191093
8541891011146845666
8980550827530989485
9619494623239763742
11220422092290893381
13296693810304296595
15599614373087607955
15665336588250767017
15687268636519650785

tatsuya6502 · 2023-05-28T04:02:11Z

If you're interested I can share more context on the original cache issue I was investigating when I found this (cache capacity slowly reducing as more and more size evictions happen). We dropped our investigation and changed our approach to save time, so I don't think I have enough evidence to raise an issue and confidently point the finger at moka, although at the time we suspected this issue was with this library.

I hope I can look into your original issue but it will be very difficult without access trace data. Cache has many moving parts and at least the followings have major impact to size-based evictions:

When an entry was last accessed? (affects to the position in the LRU queue)
How many times (and when) an entry was accessed? (affect to the historic popularity of the entry)
What is the weighted size of an entry?

moka employs "aggregated victims" strategy described in the following paper:

Lightweight Robust Size Aware Cache Management, by Gil Einziger, Ohad Eytan, Roy Friedman, Ben Manes.
Relevant code: https://github.com/moka-rs/moka/blob/v0.11.0/src/sync_base/base_cache.rs#L1690-L1706

Maybe the strategy does not match your workload, or there is a bug in the implementation. I'm not sure.

The access trace data will help us to understand what happened in your workload. No need to include the real key and value of the entry, just the hash value of the key, the access time, and the weighted size will be enough. If you can share it, I will be happy to look into it.

#231 (comment)

tatsuya6502 commented:

With the current information, it will be hard to identify the root cause. It will be very helpful if we can get access trace data from the production system. Access trace data can be a CSV file, and each line should have the following information:

Duration (milliseconds) since start.

Key, or hash of the key.

Per-key TTL.

If you use weigher, the weighted size of the entry.

Once we have such trace, we can replay it to reproduce the issue and investigate the root cause.

#247 (comment)

ben-manes commented:

you might also consider recording a trace (e.g. log a 64-bit hash of the key). Then you can run it against the different simulators that the projects provide and assert behavior, e.g. if any are not honoring the maximum size or misreporting the hit rate, and see what other policies report (like optimal). A trace of a few million events is typically pretty good; you want something many multiples of the cache size to ensure the run captures long-term the workfload characteristics.

fwojtan · 2023-05-29T19:37:28Z

Thanks - that's an interesting and satisfyingly simple explanation!

I'll see if I can get any time to look into the other issue, I never narrowed it down to moka, it could have also been a bug in our code or something interesting about the test scenario. If I find anything concrete enough to report I'll get back to you.

Thanks for your help

tatsuya6502 added the bug Something isn't working label May 27, 2023

tatsuya6502 self-assigned this May 27, 2023

tatsuya6502 added this to the v0.11.1 milestone May 27, 2023

tatsuya6502 mentioned this issue May 27, 2023

Fix occasional panic in FrequencySketch in debug build #272

Merged

2 tasks

tatsuya6502 closed this as completed in #272 May 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache housekeeper can panic (seen after 50M+ insertions) #267

Cache housekeeper can panic (seen after 50M+ insertions) #267

fwojtan commented May 26, 2023

tatsuya6502 commented May 27, 2023

tatsuya6502 commented May 28, 2023

tatsuya6502 commented May 28, 2023 •

edited

Loading

fwojtan commented May 29, 2023

Cache housekeeper can panic (seen after 50M+ insertions) #267

Cache housekeeper can panic (seen after 50M+ insertions) #267

Comments

fwojtan commented May 26, 2023

tatsuya6502 commented May 27, 2023

tatsuya6502 commented May 28, 2023

tatsuya6502 commented May 28, 2023 • edited Loading

fwojtan commented May 29, 2023

tatsuya6502 commented May 28, 2023 •

edited

Loading