Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(heavier_once_cell): assertion failure can be hit #6652

Merged
merged 5 commits into from
Feb 8, 2024

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented Feb 6, 2024

@problame noticed that the tokio::sync::AcquireError branch assertion can be hit like in the first commit. We haven't seen this yet in production, but I'd prefer not to see it there. There take_and_deinit is being used, but this race must be quite timing sensitive.

Copy link

github-actions bot commented Feb 6, 2024

2394 tests run: 2277 passed, 0 failed, 117 skipped (full report)


Code coverage (full report)

  • functions: 54.3% (11325 of 20843 functions)
  • lines: 81.3% (63541 of 78122 lines)

The comment gets automatically updated with the latest test results
f5b0e72 at 2024-02-07T07:35:08.193Z :recycle:

@koivunej
Copy link
Contributor Author

koivunej commented Feb 7, 2024

Earlier you mentioned that the loser making any forward progress requires that there cannot be a forever loop which does get*_or_init + get_mut + take_and_deinit -- I agree. What would save us from that is to "take the queue" of the old semaphore and requeue them in the same order to the new semaphore in take_and_deinit.

Solving that would mean a custom Semaphore and ll_sem in tokio speak. Especially considering the new RWLock, it's really nice to be able to close the semaphore. Alternatively we could use the Inner::value as a state variable more (is_some()) and add all permits to the current semaphore (u32::MAX or so) as "closing".. Need to consider this..

EDIT: tried to implement this in #6657.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the init_take_deinit_scenario test cases are worth it. I would have preferred a failpoint-style approach. Obviously not with failpoint's global state, but, a cfg(test) field in the heavier_once_cell to configure the failpoint.


Why did we have to switch from acquire_owned() to acquire() in this PR? Would help the clarity of this change if we could avoid that.


Regarding the core change, I think it prevents the bug we found from occuring. 95% confident it doesn't introduce new issues.

Commit message should point out that the essence of this change is the Arc::ptr_eq.

libs/utils/src/sync/heavier_once_cell.rs Show resolved Hide resolved
libs/utils/src/sync/heavier_once_cell.rs Show resolved Hide resolved
@koivunej
Copy link
Contributor Author

koivunej commented Feb 7, 2024

Why did we have to switch from acquire_owned() to acquire() in this PR? Would help the clarity of this change if we could avoid that.

So that the Arc<Semaphore> is not consumed and we can do the Arc::ptr_eq.

Commit message should point out that the essence of this change is the Arc::ptr_eq.

I don't think the Arc::ptr_eq is necessary here -- could just always loop when closed.

@koivunej koivunej merged commit 9a31311 into main Feb 8, 2024
47 checks passed
@koivunej koivunej deleted the heavier_once_cell_assertion_failure branch February 8, 2024 20:40
problame added a commit that referenced this pull request Feb 9, 2024
problame added a commit that referenced this pull request Feb 9, 2024
This PR reverts

- #6589
- #6652

because there's a performance regression that's particularly visible at
high layer counts.

Most likely it's because the switch to RwLock inflates the 

```
    inner: heavier_once_cell::OnceCell<ResidentOrWantedEvicted>,
```

size from 48 to 88 bytes, which, by itself is almost a doubling of the
cache footprint, and probably the fact that it's now larger than a cache
line also doesn't help.

See this chat on the Neon discord for more context:

https://discord.com/channels/1176467419317940276/1204714372295958548/1205541184634617906

I'm reverting 6652 as well because it might also have perf implications,
and we're getting close to the next release. We should re-do its changes
after the next release, though.

cc @koivunej 
cc @ivaxer
koivunej added a commit that referenced this pull request Feb 12, 2024
@problame noticed that the `tokio::sync::AcquireError` branch assertion
can be hit like in the added test. We haven't seen this yet in
production, but I'd prefer not to see it there. There `take_and_deinit`
is being used, but this race must be quite timing sensitive.

Rework of earlier: #6652.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants