Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to restart mackerel-agent with slow cloud meta-data service #412

Closed
hanazuki opened this issue Sep 7, 2017 · 3 comments
Closed

Comments

@hanazuki
Copy link
Contributor

hanazuki commented Sep 7, 2017

Since recent update Custom Identifier (eg. AWS instance id) check is introduced( #405 ). This check ensures that the Custom Identifier of the current instance matches the one previously stored on the mackerel.io server for the HostID. In our setup, this check prevents mackerel-agent to start up.

The Custom Identifier of the current instance is fetched from the cloud meta-data service (i.e. http://169.25.169.254 in case of AWS). The meta-data service is expected to return response within 100ms:

var timeout = 100 * time.Millisecond

When the meta-data service is unable to respond within 100ms, an empty string is used as the Custom Identifier of the current instance, which does not match the one successfully stored on the server in the previous agent run.

if result.CustomIdentifier != "" && result.CustomIdentifier != customIdentifier {
if fsStorage, ok := conf.HostIDStorage.(*config.FileSystemHostIDStorage); ok {
return nil, fmt.Errorf("custom identifiers mismatch: this host = \"%s\", the host whose id is \"%s\" on mackerel.io = \"%s\" (File \"%s\" may be copied from another host. Try deleting it and restarting agent)", customIdentifier, hostID, result.CustomIdentifier, fsStorage.HostIDFile())
}
return nil, fmt.Errorf("custom identifiers mismatch: this host = \"%s\", the host whose id is \"%s\" on mackerel.io = \"%s\" (Host ID file may be copied from another host. Try deleting it and restarting agent)", customIdentifier, hostID, result.CustomIdentifier)
}

(customIdentifier = "" and result.CustomIdentifier = (previously stored instance-id))

We use OpenStack, which provides AWS-compatible meta-data service at 169.254.169.254, but in our setup it sometimes take a bit longer than 100ms to respond. Due to the reason described above, we are having trouble that mackerel-agent often fails to restart.

I have some idea to deal with this problem,

  • Add an option to extend the HTTP timeout for the meta-data service,
  • Add an option to disable cloud meta-data retrieval, or
  • Skip the check when the Custom Identifier cannot be fetched for some reason.

Which one is preferred? I can provide a patch for the solutions if necessary.

@hanazuki hanazuki changed the title Unable to start mackerel-agent with slow cloud meta-data service Unable to restart mackerel-agent with slow cloud meta-data service Sep 7, 2017
@itchyny
Copy link
Contributor

itchyny commented Sep 7, 2017

Thank you. I think we can simply extend the default timeout (adding an option for this would not be a good idea). IMO 5 seconds for timeout is acceptable. What do you think? @mechairoi @Songmu

@hanazuki
Copy link
Contributor Author

hanazuki commented Sep 7, 2017

I have missed #401#398 . My description above seems wrong.

Before #401#398, VM instances on our OpenStack pass isEC2 check since the AWS-compatible meta-data service returns ami-id. After the change, they don't because the kernel do not have /sys/hypervisor/uuid (we use qemu-kvm instead of Xen) and thus Custom Identifier seems not fetched from the meta-data service.

Update: refer to the correct PR

@Songmu
Copy link
Contributor

Songmu commented Jan 11, 2019

The timeout duration is 3 seconds now. There is also a way to explicitly specify cloud_platform =" ec2 " in the configuration file.

Close this issue as it has already been resolved.

@Songmu Songmu closed this as completed Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants