Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime events consumer: don't parse dropped events #12062

Merged

Conversation

TheLortex
Copy link
Contributor

In the runtime events consumer library, there's a code path that handles the case when an event content is read, but in the meantime was overwrote by the producer. In that case, it calls the the lost_events callback if it exists, but then continues parsing the event with whatever data overwritten in the ring buffer, yielding incorrect values.

This problem appeared when I used custom runtime events at a very high rate (700k/s). Indeed, at a low rate this code path is rarely taken.

The PR includes a reproduction test that spawns two domains, one emitting custom events at a high rate and one consuming them and checking that the associated value is expected (0). If it's different, it means that the event payload was overwrote with a new event block (the expected zero value would then contain a header or timestamp value parsed as an integer).

The proposed fix is the following: continue (jump over event parsing) for all code paths as long as the event dropped condition is met:

 /* Check the message we've read hasn't been overwritten by the writer */
 if (ring_head > cursor->current_positions[domain_num]) {
     ...
+    continue;
 }

@Engil @sadiqj

Copy link
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @TheLortex , thanks for identifying and a fix.

Only one issue is I can't get the test to fail in the case where the fix isn't applied. Does it fail reliably for you?

@TheLortex
Copy link
Contributor Author

Does it fail reliably for you?

Yes it does. Maybe try increasing the numbers in the two for loops ?

@TheLortex TheLortex force-pushed the runtime-events-consumer-fix-lost-events branch from eccb7f9 to 99abe06 Compare March 8, 2023 10:32
Copy link
Contributor

@abbysmal abbysmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the fix and I believe it is correct.

@sadiqj I can reliably reproduce it on my side (you can check on bremusa as well, I've seen it fail there.), so I think this may not be a blocker?

@sadiqj
Copy link
Contributor

sadiqj commented Mar 16, 2023

so I think this may not be a blocker?

Agreed, thanks for taking a look @Engil . Maybe it's just a quirk of my local setup. I think we're good to go.

@kayceesrk kayceesrk merged commit e72c139 into ocaml:trunk Mar 18, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants