Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix division by zero crash when doing metering #529

Conversation

amarsri28
Copy link
Contributor

this one resolve #376 . we can test this patch with Metering stress test . Memory corruption(caused by stl container append) was happening as internal container address was referred and used.
commits were reverted to resolved issue (original code seems working fine)

we dont need workaround #385

@ccascone ccascone changed the title Bess metering division by zero crash #376 Fix division by zero crash when doing metering Feb 28, 2022
@pudelkoM
Copy link
Member

This is just a revert of both the profile optimization (d2c0793) and the following workaround, right? Seems safe to me, the old code worked before.

Do we have a test case showing the crash right now, and that it's fixed after this change?

Copy link
Contributor

@ccascone ccascone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the change. I'm temporarily leaving a -1 since we have a release happening this week and we want to avoid last-minute disruptions. We can merge next week.

@amarsri28
Copy link
Contributor Author

amarsri28 commented Mar 2, 2022

@ccascone I am not sure the purpose and process of your Release, but metering functionality is broken (ue, app level etc) so if the purpose is to give bug free release along with Qos-Metering feature(latest code set), can consider this patch for release testing.
Furthermore push it for production testing asap.
@pudelkoM use #528 for testing,
after 20 secs (time defined by #528 ) we can observe absurd metering with 13mpps in below figure where as 1 mpps is the correct rate which ll come for first 20secs using this test case(tested with latest cloned code).

image

@amarsri28
Copy link
Contributor Author

below is the correct metering rate ,tested by #528 , come for 20 secs as defined in testcase with latest cloned code

image

@pudelkoM
Copy link
Member

pudelkoM commented Mar 2, 2022

From our slack discussion:

My understanding is as follows: Sai’s patch added an indirection on the meter profile (rates + more config) by coalescing identical meter configs and storing them in a hash map (params_map_). Before, each meter entry had its profile inline, now it’s a pointer to the shared instance. DPDK hash maps are not not stable on inserts:

As mentioned above, Cuckoo hash implementation pushes elements out of their bucket, if there is a new entry to be added which primary location coincides with their current bucket

https://doc.dpdk.org/guides/prog_guide/hash_lib.html

This means pointers to previously looked up data become invalidated on inserts and then point to random data (could be zero or just garbage). This causes the fluctuating rates, as the actual profile (with cir, pir, …) has been moved into a different bucket, but the pointer used in the data path (val[j]->p) was not updated.

Carmelo's workaround relied on the coincidence that that memory seemed to get cleared, hence showed 0 rates.

Abseil’s explanation of pointer stability in hash maps: https://abseil.io/docs/cpp/guides/container#fn:pointer-stability

@ccascone
Copy link
Contributor

ccascone commented Mar 2, 2022

@amarsri28 apologies, I originally misunderstood the issue and fix. After discussing with @pudelkoM (thanks for the explanation) we agree that this should be merged before the release.

BTW, I was talking about the Aether release. We are currently in the process of validating the Aether 2.0 release (due very soon) using automated system-level integration tests:
https://docs.aetherproject.org/master/testing/about_system_tests.html

Here the "system" is Aether, which includes many components other than the UPF and core. We have tests that verify the integration with the fabric (SD-Fabric), the management GUI/API (ROC), etc. We do this for both 4G and 5G core. So lots of moving pieces. Debugging of failed tests is mostly manual for now and it currently takes a lot of time. For this reason, as we approach the release, we tend to be careful in merging new changes to avoid introducing unnecessary disruption that might distract us from other existing bugs.

I hope this gives you some perspective on our processes.

@ccascone ccascone merged commit 9cb736e into omec-project:master Mar 2, 2022
@amarsri28
Copy link
Contributor Author

just stating facts as it caused confusion over dpdk hash table reliability to our team members.
Global reference to c++ container memory were taken and as container expand , it may reallocate memory etc as a part of reallocation process, which has caused issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Integer division by zero crash in bessd
3 participants