Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need more graceful handling of duplicate machine GUIDs #5488

Closed
cakrit opened this issue Feb 24, 2019 · 8 comments
Closed

Need more graceful handling of duplicate machine GUIDs #5488

cakrit opened this issue Feb 24, 2019 · 8 comments

Comments

@cakrit
Copy link
Contributor

cakrit commented Feb 24, 2019

Bug report summary

We spent a lot of time trying to identify the root cause of some problems in a master slave setup. The root cause ended up being that one of the slaves was created by cloning the master, taking the same machine GUID with it. We should have at least a detection of such cases, so we can warn users about it.

Steps To Reproduce

Master and slave with the same GUID.
The master hostname configured in netdata.conf is overwritten by the slave's hostname and we have issues with the charts as well.

Expected behavior

Report an error, or have a self-cleaning behavior (e.g. regenerate a machine GUID when a duplicate is reported).

@paulkatsoulakis
Copy link
Contributor

paulkatsoulakis commented Feb 24, 2019

It seems that the current implementation is informing the user with a warning, while at the same time replaces the host information with the one that just arrived.

Offending line:

if(strcmp(host->hostname, hostname) != 0) {

From a brief research, i could see the following possible solutions:

  • The change with the least customer impact: Always calculate the GUID upon start-up instead of attempting to find it on file first. Offending method:

    char *registry_get_this_machine_guid(void) {

    To do this though, we will have to rethink the GUID calculation, so that we can statically define it on each machine.
    Then we need to make the warning in the following line a hard error that would prompt the user of the situation and stop the start up (in case this happens again on some unpredictable scenario)
    if(strcmp(host->hostname, hostname) != 0) {

  • If GUID uniqueness can't be approached in a way that the identifier will be regenerated at run-time instead of being saved on the file, then we could simply fail the start-up when this happening so that at least we avoid raising that conflict. The downturn here is that we add the effort to our customer to take action and fix the conflict.

  • A much more risky approach to mitigate this could be the auto-fixing of GUID when it arrives on the master host. Conflicting GUIDs could be specially treated on master by adding an extra identifier on the GUID string when saving it locally. For example we could append the md5 of the IP, so that it can be reconstructed every time is seen and not conflict with the master. It might sound easy change, but it might require quite a few adjustments to do it right (i haven't gone through it thoroughly)

@cakrit
Copy link
Contributor Author

cakrit commented Feb 25, 2019

The change with the least customer impact: Always calculate the GUID upon start-up instead of attempting to find it on file first.

The machine guid is supposed to be a rather permanent identification of a host. The registry depends on that more than it does on hostnames and urls. So we can't be changing it all the time.

The other two options seem better. If we already identify the situation (you said you saw a warning), then we could auto-correct it on the master. We don't want to change the GUID format, but we could regenerate it. There is a question though on whether by that time we will already have some corrupted data. So it may actually be necessary for the master to refuse to process any of the slave's data, until the duplicate GUID situation is resolved, either manually or automatically.

@ktsaou
Copy link
Member

ktsaou commented Feb 25, 2019

Fixed in #5497

@cakrit
Copy link
Contributor Author

cakrit commented Feb 25, 2019

@paulkatsoulakis please test that PR and we'll merge.

@paulkatsoulakis
Copy link
Contributor

The fix does not seem to get us all the way there.
We are still allowing the execution of rrdhost_find_or_create() that modifies our configuration, thus resulting on the UI showing the wrong host as the default host that should be the master.

host = rrdhost_find_or_create(

If i get the change right, we do block the eventual data corruption, but we still need to make sure we avoid modifying the master host details. So.. probably this check should go at the very beginning to completely avoid any further processing ?

@cakrit
Copy link
Contributor Author

cakrit commented Feb 25, 2019

Ok, can you do a different PR based on #5497 to take care of the UI as well?

@paulkatsoulakis
Copy link
Contributor

Good morning, sure i 'm on it

@paulkatsoulakis
Copy link
Contributor

I think we can resolve this now (i can't resolve)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants