Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix coverity erro (CID 349552) double lock #6970

Merged
merged 11 commits into from
Oct 1, 2019

Conversation

thiagoftsm
Copy link
Contributor

Summary

Fixes #6969

We had a deadlock inside the database after to merge template to dimension, this commit fixes this error.

Component Name

Health

Additional Information

We had a double lock before, this commit fix this
@squash-labs
Copy link

squash-labs bot commented Sep 28, 2019

Manage this branch in Squash

Test this branch here: https://thiagoftsmdim-template-fix-neb5x.squash.io

@mfundul
Copy link
Contributor

mfundul commented Sep 28, 2019

I don't think this is enough, since you are only holding the host lock and you allow the dimension to be deleted during:
rrdcalc_link_to_rrddim(rd, st, host);
by a different thread.

This commit fix the order process
@thiagoftsm
Copy link
Contributor Author

I push another commit after to read your newest text in the ISSUE related to this PR.

@mfundul
Copy link
Contributor

mfundul commented Sep 28, 2019

The only question remaining to be answered is whether a caller of rrddim_add_custom() already holds the host lock, which would result in a deadlock. If that is not the case, then this should suffice.

@@ -204,7 +208,6 @@ RRDDIM *rrddim_add_custom(RRDSET *st, const char *id, const char *name, collecte
return rd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to handle error cases such as this one. You could add something like:

if (host_locked)
    rrdhost_unlock(host);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was verifying Netdata code here, the function rrdhost_wrlock and rrdhost_unlock are definitions to other internal functions, netdata_rwlock_wrlock_debug and netdata_rwlock_unlock_debug, these other two functions call pthread functions to try lock and try unlock, these functions already have error messages inside it.
I tried to use pthread_rwlock_trywrlock to check errors and I received a lot RW_LOCK: failed to obtain write lock (code 35) and the alarms where not created, os I returned for the current code we have here, because it is working fine.

@thiagoftsm
Copy link
Contributor Author

thiagoftsm commented Sep 28, 2019

The only question remaining to be answered is whether a caller of rrddim_add_custom() already holds the host lock, which would result in a deadlock. If that is not the case, then this should suffice.

Without to use function rrdhost_lock() I have the following error message:

2019-09-28 16:48:09: netdata ERROR : PLUGIN[proc] : RW_LOCK: failed to obtain write lock (code 35)

and after this Netdata exit with failure, so it is necessary to call this function.

I am returning for the first solution, because the others are generating:
@thiagoftsm
Copy link
Contributor Author

I tested all options that you wrote in the original issue and here, and I returned for the first solution that you gave us, because it is the unique that does not give us PLUGINSD[apps] : RW_LOCK: failed to obtain write lock (code 35) .
For sure we do not need to lock the chart, because the last part of the code inside the function only create alarms, so we are only reading chart names, the necessity to lock host is related to the alarms associated to it.

@mfundul
Copy link
Contributor

mfundul commented Sep 28, 2019

I tested all options that you wrote in the original issue and here, and I returned for the first solution that you gave us, because it is the unique that does not give us PLUGINSD[apps] : RW_LOCK: failed to obtain write lock (code 35) .
For sure we do not need to lock the chart, because the last part of the code inside the function only create alarms, so we are only reading chart names, the necessity to lock host is related to the alarms associated to it.

This will not work because the code is referencing the chart and dimension by pointers st and rd respectively from:

 rrdcalc_link_to_rrddim(rd, st, host);

and if for example a thread is deleting a dimension (rd) then netdata will crash.

@mfundul
Copy link
Contributor

mfundul commented Sep 28, 2019

That is a theoretical problem of course, since it's hard to think of a practical case for this, since the dimension has just been created. But still, it's a potential bug, and if there's no chance for it now, there may be in the future.

This solution try to lock the host before to move in front
@thiagoftsm
Copy link
Contributor Author

That is a theoretical problem of course, since it's hard to think of a practical case for this, since the dimension has just been created. But still, it's a potential bug, and if there's no chance for it now, there may be in the future.

You are correct, I thought the situation again and I found a way to avoid the error reported and do not raise errors with our current code in this last commit, please, verify this solution now, I think we can accept 5 tries to lock the host to create the alarms, mainly considering that this did not happen in my tests and probably during the review, because nobody reported errors related to this. But, I have a doubt related to the time I am waiting for to try lock again.

hostlocked = pthread_rwlock_trywrlock(&host->rrdhost_rwlock);
if(!hostlocked)
{
rrdcalc_link_to_rrddim(rd, st, host);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are reusing the rd dimension pointer without holding the lock for chart st. If rd is deleted in the meantime it would crash. Moreover, even if you retook the chart lock, still you would be reusing the rd pointer which could have been freed during the period when you did not actually hold the chart lock.

We need to think this through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are creating rd in this function, so we cannot delete it at this point. What we can do is to move back rrdset_unlock this will avoid we delete the chart. Considering that the write lock for chart is different of write lock for host, I do not think the problem was there. Right now I really believe that the error reported from Coverity was related I try to lock something that was possibly locked. I moved the function back and I did not notice any error reported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, since you moved the lock again rd is safe again but you still keep the wrong lock order only now you're try-locking instead of locking. So, in the previous cases of deadlock, now you're delaying and failing, which is probably why coverity does not write this down. But still, there's probably cases where we delay and fail during try-locking. Is that a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not detect any error while Netdata was trying to lock to create the alarm, but I had when I tried to do some changes in the previous commits.
About your question, I thought about it and I decided to try 5 times and finally raise an error. What can happen is we do not have an alarm created, but I used Netdata simple pattern to create 11 dimensions in a chart and I did not have any error when I tried to lock, so I think this is not a problem, at least it was not a problem with the charts I tried.
After to read your message I decided to repeat a test that I did in your PR, I changed the example.python to have 160 dimensions and I created an alarm using foreach to create an alarm for each dimensions and again I did not have any error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you fix the grammar of the error message:

After to try creating an alarm for dimension %s on chart %s 5 times without success, I am moving in front

to something like:

Failed to create an alarm for dimension %s of chart %s 5 times. Skipping alarm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I executed the changes asked and Netdata continues working without raise errors and with all alarms created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please reduce the number of lines according to our latest code style specifications

@thiagoftsm clion can autoformat code, just use it, see code section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this, thanks @ilyam8!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm waiting for the code style change to approve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit fixes the format. Thanks!

To avoid the chart to be deleted while we are linking the alarm
I am moving bak the chart lock
This commit fixes the grammar of an error message
Bring the defined pattern to the code and use netdata_rwlock_trywrlock
This commit fixes a format missing
@thiagoftsm thiagoftsm merged commit c48d52e into netdata:master Oct 1, 2019
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request Oct 9, 2019
* dim_template_fix: Fix lock

We had a double lock before, this commit fix this

* dim_template_fix: Fix order

This commit fix the order process

* dim_template_fix: Return

I am returning for the first solution, because the others are generating:

* dim_template_fix: Try to lock

This solution try to lock the host before to move in front

* dim_template_fix: Move chart lock

To avoid the chart to be deleted while we are linking the alarm
I am moving bak the chart lock

* dim_template_fix: Fix grammar

This commit fixes the grammar of an error message

* dim_template_fix: bring pattern

Bring the defined pattern to the code and use netdata_rwlock_trywrlock

* dim_template_fix Fix format

This commit fixes a format missing
@thiagoftsm thiagoftsm deleted the dim_template_fix branch November 13, 2019 12:03
jackyhuang85 pushed a commit to jackyhuang85/netdata that referenced this pull request Jan 1, 2020
* dim_template_fix: Fix lock

We had a double lock before, this commit fix this

* dim_template_fix: Fix order

This commit fix the order process

* dim_template_fix: Return

I am returning for the first solution, because the others are generating:

* dim_template_fix: Try to lock

This solution try to lock the host before to move in front

* dim_template_fix: Move chart lock

To avoid the chart to be deleted while we are linking the alarm
I am moving bak the chart lock

* dim_template_fix: Fix grammar

This commit fixes the grammar of an error message

* dim_template_fix: bring pattern

Bring the defined pattern to the code and use netdata_rwlock_trywrlock

* dim_template_fix Fix format

This commit fixes a format missing
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request May 21, 2020
* dim_template_fix: Fix lock

We had a double lock before, this commit fix this

* dim_template_fix: Fix order

This commit fix the order process

* dim_template_fix: Return

I am returning for the first solution, because the others are generating:

* dim_template_fix: Try to lock

This solution try to lock the host before to move in front

* dim_template_fix: Move chart lock

To avoid the chart to be deleted while we are linking the alarm
I am moving bak the chart lock

* dim_template_fix: Fix grammar

This commit fixes the grammar of an error message

* dim_template_fix: bring pattern

Bring the defined pattern to the code and use netdata_rwlock_trywrlock

* dim_template_fix Fix format

This commit fixes a format missing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock defect introduced in database
4 participants