Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix use-after-free bug on an internal deque node #15

Merged
merged 4 commits into from
Aug 12, 2023

Conversation

tatsuya6502
Copy link
Member

@tatsuya6502 tatsuya6502 commented Aug 11, 2023

Fixes #11.

This bug caused memory corruption, leading a panic reported by #11 and also segmentation fault.

The root cause

This was a use-after-free bug caused by a dangling NonNull pointer.

  • Each cached entry has an internal struct ValueEntry.
  • ValueEntry has a NonNull pointer to a deque node.
  • When insert on an existing entry is called, ValueEntry is replaced with a new one.
    • The NonNull pointer is copied to the new ValueEntry, so both old and new ValueEntrys have the same NonNull pointer.
  • If the entry is removed from the cache, the deque node is dropped and the NonNull pointer is unset in the new ValueEntry.
    • However, the old ValueEntry can be still referenced from a read log, and its NonNull pointer becomes dangling.
    • If data is modified via the dangling pointer, it will get corrupted.

A possible steps to reproduce

This is a timing issue. Here is one of the possible reproducing steps.

  1. Call insert for a key A.
    • This creates a ValueEntry (1).
  2. Call get on A.
    • This creates a read log pointing to ValueEntry (1).
  3. Update A by calling insert.
    • This creates a new ValueEntry (2) with copied NonNull pointer.
  4. By a some reason, A is evicted from the cache:
    • Possible reasons:
      • A is expired.
      • invalidate is called on A.
      • the cache got full and A is selected to evict.
    • The deque node for A is dropped, and then the NonNull pointer in ValueEntry (2) is set to None. (Its type is Option<NonNull<...>>)
  5. The read log from step 2 is processed.
    • This tries to modify the deque node pointed by the NonNull pointer in ValueEntry (1).
    • However, the deque node was already dropped at step 3, so modifying it (use-after-free) will cause one of the followings:
      • (a) Corrupts other data on the address where the dangling NonNull points to.
      • (b) Gets segmentation fault.
  6. If 5-(a), the corrupted data may be accessed later, which can cause the panic reported by panic: concurrent deque entered unreachable code #11.

Fixes

  • Before using, check if the NonNull pointer is still valid.
    • ValueEntry has an EntryInfo struct, and it has a flag to know if the deque node still exists.
  • When updating an entry by insert, do not copy NonNull pointer.
    • Moved the NonNull pointer from ValueEntry struct to EntryInfo struct.
    • When updating an entry, EntryInfo is not copied, but shared between new and old ValueEntrys by using an Arc pointer.

Move internal pointers to deque nodes from `ValueEntry` to `EntryInfo`, so that
the pointers are no longer copied on updating an entry.
@tatsuya6502 tatsuya6502 self-assigned this Aug 11, 2023
@tatsuya6502 tatsuya6502 added the bug Something isn't working label Aug 11, 2023
@tatsuya6502 tatsuya6502 added this to the v0.10.2 milestone Aug 11, 2023
@tatsuya6502
Copy link
Member Author

I found the following program can reproduce the panic (and segmentation faults) with v0.10.1.

// Cargo.toml
//
// [dependencies]
// mini-moka = "0.10.1"
// #mini-moka = { git = "https://github.com/moka-rs/mini-moka", branch = "fix-panics-in-deque" }
//
// rand = { version = "0.8.5", features = ["small_rng"] }
// rand_distr = "0.4.3"

use std::{io::Write, sync::Arc};

use mini_moka::sync::Cache;
use rand::{rngs::SmallRng, seq::SliceRandom, SeedableRng};
use rand_distr::{Distribution, Normal};

const MAX_CAPACITY: u64 = 100;
const NUM_KEYS: usize = 200;

const NUM_THREADS: usize = 8;
const OPERATIONS_PER_THREAD: usize = 10_000_000;

enum Op {
    Get,
    Insert,
    Invalidate,
}

const OP_CHOICE: &[(Op, usize)] = &[(Op::Get, 4), (Op::Insert, 2), (Op::Invalidate, 1)];

fn main() {
    // Create the cache.
    let cache = Arc::new(Cache::builder().max_capacity(MAX_CAPACITY).build());

    let mut seed_rng = rand::thread_rng();

    // Spawn threads that will perform operations on the cache.
    let threads = (0..NUM_THREADS)
        .map(|id| {
            let cache = Arc::clone(&cache);
            let mut rng = SmallRng::from_rng(&mut seed_rng).unwrap();
            let normal_distr = Normal::new((NUM_KEYS / 2) as f64, 20.0).unwrap();

            std::thread::spawn(move || {
                for i in 1..=OPERATIONS_PER_THREAD {
                    let num = normal_distr.sample(&mut rng) as u16;
                    let key = format!("key-{}", num);

                    // Generate an operation.
                    let op = &OP_CHOICE
                        .choose_weighted(&mut rng, |item| item.1)
                        .unwrap()
                        .0;

                    // Run the operation.
                    match op {
                        Op::Get => {
                            cache.get(&key);
                        }
                        Op::Insert => {
                            cache.insert(key, i);
                        }
                        Op::Invalidate => {
                            cache.invalidate(&key);
                        }
                    }

                    // Print progress if we are the first thread.
                    if id == 0 {
                        if i % 10_000 == 0 {
                            print!(".");
                            std::io::stdout().flush().unwrap();
                        }
                        if i % 1_000_000 == 0 {
                            println!();
                        }
                    }
                }

                if id == 0 {
                    println!();
                }
            })
        })
        .collect::<Vec<_>>();

    // Wait for all threads to finish.
    for thread in threads {
        thread.join().unwrap();
    }

    println!("Done.");
}
  • It crashes immediately after start if mini-moka = "0.10.1" is used.
  • It does not crash if this branch is used:
    • mini-moka = { git = "https://github.com/moka-rs/mini-moka", branch = "fix-panics-in-deque" }
  • Environment:
    • Rust 1.71.1, target: aarch64-apple-darwin
    • Apple silicon (M2), 4 × P-cores + 4 × E-cores

@tatsuya6502
Copy link
Member Author

If I run the above program on v0.10.1 with slightly modified:

  • get disabled → runs OK.
  • invalidate disabled → runs longer but eventually panics or gets segfault.

Make `apply_reads` to check if the deque node is still valid before moving it in
the deque.
@tatsuya6502
Copy link
Member Author

I found the root cause and fixed it in 84c160c. Now I know why the previous commit ef90224 also fixed the issue. I will keep both commits for extra safety.

Later, I will update the root cause section of the descriptions of this PR.

@tatsuya6502
Copy link
Member Author

Later, I will update the root cause section of the descriptions of this PR.

Updated the (entire) descriptions.

@tatsuya6502 tatsuya6502 changed the title Try to fix the panic in concurrent Deques Fix use-after-free on an internal deque node Aug 12, 2023
@tatsuya6502 tatsuya6502 changed the title Fix use-after-free on an internal deque node Fix use-after-free bug on an internal deque node Aug 12, 2023
@tatsuya6502 tatsuya6502 marked this pull request as ready for review August 12, 2023 09:36
Copy link
Member Author

@tatsuya6502 tatsuya6502 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging into v0.10.x branch.

@tatsuya6502 tatsuya6502 merged commit e95c168 into v0.10.x Aug 12, 2023
28 checks passed
@tatsuya6502 tatsuya6502 deleted the fix-panics-in-deque branch August 12, 2023 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant