Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azure Monitor output plugin #4089

Merged
merged 29 commits into from
Sep 5, 2018
Merged

Add Azure Monitor output plugin #4089

merged 29 commits into from
Sep 5, 2018

Conversation

gunnaraasen
Copy link
Member

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

This is a new output plugin for Azure Monitor. I will be adding more unit tests soon.


```
# Configuration for sending aggregate metrics to Azure Monitor
[[outputs.azuremonitor]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required but consider renaming azure_monitor.

## specified, the plugin will attempt to retrieve the resource ID
## of the VM via the instance metadata service (optional if running
## on an Azure VM with MSI)
#resourceId = "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Compute/virtualMachines/<vm-name>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use snake_case for all options.

}

if resp.StatusCode >= 300 || resp.StatusCode < 200 {
return nil, fmt.Errorf("Post Error. HTTP response code:%d message:%s, content: %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a "GET Error"?

}

if resp.StatusCode >= 300 || resp.StatusCode < 200 {
return nil, fmt.Errorf("Post Error. HTTP response code:%d message:%s reply:\n%s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET Error, also convert to a single line error string since these will end up in the logging output.

// t.Logf("metadata is \n%v", metadata)
// }

//fmt.Printf("metadata is \n%v", metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to clear out this code.

metadataService *AzureInstanceMetadata
instanceMetadata *VirtualMachineMetadata
msiToken *msiToken
msiResource string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a package const

instanceMetadata *VirtualMachineMetadata
msiToken *msiToken
msiResource string
bearerToken string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use directly from msiToken

msiToken *msiToken
msiResource string
bearerToken string
expiryWatermark time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this being used but where is it set?

expiryWatermark time.Duration

oauthConfig *adal.OAuthConfig
adalToken adal.OAuthTokenProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I see being set but not used.

period time.Duration
delay time.Duration
periodStart time.Time
periodEnd time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two times can just be locals

return nil, fmt.Errorf("Error authenticating: %v", err)
}

metricsEndpoint := fmt.Sprintf("https://%s.monitoring.azure.com%s/metrics",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to be careful here -- As custom metrics is in preview on the Azure Monitor side, we don't support all regions of public Azure. We are only going to be available for a few regions as part of the reviews so not all these endpoints will exist.

return nil, fmt.Errorf("Error authenticating: %v", err)
}

metricsEndpoint := fmt.Sprintf("https://%s.monitoring.azure.com%s/metrics",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any of the endpoints that the output plugin is communicating with (ex. https://.monitoring.azure.com, we should ideally move all these endpoints into single place in the code, not sprinkle them across all files. This will help future proof the plugin to more easily support other Azure clouds down the road (Azure Germany, Azure China, Azure US Government) which will all have their own endpoints.

@gunnaraasen
Copy link
Member Author

@danielnelson I've updated this branch to use the running output aggregator pattern you suggested. I moved the original PR code to the ga-azure-monitor-original branch. I will add some tests if you think the architecture makes sense now.

@danielnelson
Copy link
Contributor

Looks good, like how you dealt with not having the filter for old metrics. Gives me some ideas for improving this with the normal aggregators.

Copy link

@asheniam asheniam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding feedback on the Azure Monitor output plugin

#resource_id = "/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/virtualMachines/<vm_name>"
## Azure region to publish metrics against. Defaults to eastus.
## Leave blank to automatically query the region via MSI.
#region = "useast"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "eastus" as the default, not "useast". "useast" will not work -- it won't resolve to any monitoring endpoint.

}

const (
defaultRegion string = "eastus"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have a default region constant. This value either needs to come from instance metadata or come from user configuration. The region must match the region of the Azure resource ID and can't be guessed.

return err
}

req.Header.Set("Authorization", "Bearer "+a.msiToken.AccessToken)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not using MSI, where are we setting a.adalToken?

}
defer resp.Body.Close()

if resp.StatusCode >= 300 || resp.StatusCode < 200 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only doing fmt.Errorf for status codes in the [300, 200) range. We should also follow this pattern for any errors encountered in the 4xx or 5xx range

Data: &azureMonitorData{
BaseData: &azureMonitorBaseData{
Metric: m.Name(),
Namespace: "default",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we choose to set the metric namespace to be "default"? This might be good for users to override via config but as a default, it might be better to go with a value which has "Telegraf" in the namespace.

for _, m := range azmetrics {
// Azure Monitor accepts new batches of points in new-line delimited
// JSON, following RFC 4288 (see https://github.com/ndjson/ndjson-spec).
jsonBytes, err := json.Marshal(&m)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large can azmetrics be? I believe the Azure Monitor metric API has a max request body size of 4MB. If we exceed this limit, we should issue multiple POST requests

# timeout = "5s"

## Whether or not to use managed service identity.
#use_managed_service_identity = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In the sample configuration, we should make it clear that when use_managed_service_identify is false, it's required that the user supply resource_id, region, azure_subscription, azure_tenant, azure_client_id, and azure_client_secret. These become mandatory parameters.

useMsi bool `toml:"use_managed_service_identity"`
ResourceID string `toml:"resource_id"`
Region string `toml:"region"`
Timeout internal.Duration `toml:"Timeout"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should this be lower case "timeout" instead of "Timeout"?

AzureTenantID string `toml:"azure_tenant"`
AzureClientID string `toml:"azure_client_id"`
AzureClientSecret string `toml:"azure_client_secret"`
StringAsDimension bool `toml:"string_as_dimension"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is StringAsDimension? There isn't such example in the sample config?

return &AzureMonitor{
StringAsDimension: false,
Timeout: internal.Duration{Duration: time.Second * 5},
Region: defaultRegion,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, we shouldn't treat region special with a default region. This should be treated as the rest of the configuration -- either it comes from instance metadata or from user supplied config.

@asheniam
Copy link

@danielnelson - I saw the milestone changed from 1.7 to 1.8. What is the timeline for 1.8?

@danielnelson
Copy link
Contributor

I believe 1.8 will be finished around the end of August? Does that sounds right @russorat?

continue
}
for id := range a.cache[tbucket] {
a.cache[tbucket][id].updated = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the interval is lower than the 1m aggregation period, this can get set to false before the metric has had a chance to be returned in Push() since you may have multiple calls to Reset() before the metric is returned. This causes the plugin not to write any metrics unless you set the set the flush_interval to at least 1m.


// Pull region and resource identifier
err := a.GetInstanceMetadata()
if err != nil && a.ResourceID == "" && a.Region == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be || instead of &&, we should always return if err != nil, so that we are sure to see all errors and I think we need both of these set.

Also, if Region and ResourceID are set beforehand, maybe we can skip this function completely?

}

// GetInstanceMetadata retrieves metadata about the current Azure VM
func (a *AzureMonitor) GetInstanceMetadata() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally this function would return region, resource, error. The calling function would combine these with the plugin settings and create the url. Also, consider passing in the client and making this and the function above a free function.

if err != nil && a.ResourceID == "" && a.Region == "" {
return fmt.Errorf("E! No resource id specified, and Azure Instance metadata service not available. If not running on an Azure VM, provide a value for resource_id")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have the final discovered URL, I think we should add a log message with the final version at debug level.

@gunnaraasen gunnaraasen force-pushed the ga-azure-monitor branch 4 times, most recently from 3639d05 to 50854bf Compare July 11, 2018 15:55
Identity](https://docs.microsoft.com/en-us/azure/active-directory/msi-overview)
for more details. Only available on ARM-based resources.

**Note: As shown above, the last option (#5) is the preferred way to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be number 4

**Note: As shown above, the last option (#5) is the preferred way to
authenticate when running Telegraf on Azure VMs. The VMs will need to be given
access to the Azure Monitor to publish custom metrics. Instructions on how to
grant access can be found [here]()**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add the link target. I suggest just turning the whole sentence into a link:

[Instructions on how to grant access](http://example.org).

If Telegraf is not running on a virtual machine or the VM Instance Metadata service is not available, the following variables are required for the output to function.

* region
* resourceId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource_id


This plugin will send custom metrics to Azure Monitor.
Azure Monitor has a metric resolution of one minute.
To handle this in Telegraf, the Azure Monitor output plugin will automatically aggregates metrics into one minute buckets, which are then sent to Azure Monitor on every flush interval.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you are trying to do the "semantic linefeeds" style, but can you make sure to wrap all the lines at no more than 78 chars. As an aside, I don't really care for this style, I find it hard to read the plain text and the diff advantage is small as changes can be handled using --word-diff.


### Configuration:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this can you run telegraf --usage azure_monitor and use the output (minus the Description text).

return err
}

// req, err := http.NewRequest("POST", a.url, bytes.NewBuffer(body))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out code

// refresh the token if needed.
req, err = autorest.CreatePreparer(a.auth.WithAuthorization()).Prepare(req)
if err != nil {
return fmt.Errorf("E! [outputs.azure_monitor] Unable to fetch authentication credentials: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error, not a log message, so don't add the log level or module: unable to fetch authentication....

}

if resp.StatusCode < 200 || resp.StatusCode > 299 {
return fmt.Errorf("E! Failed to write: %v", string(rbody))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove log level, start with lowercase. I would not include the body as it could be very long.

continue
}
for id := range a.cache[tbucket] {
a.cache[tbucket][id].updated = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should reset this in Push, this will be more resilient against reordered calls even though they shouldn't happen in the current implementation: Push -> Add -> Reset. Also, Push feels like the more appropriate place to clear the update flag.

@@ -0,0 +1,275 @@
package azure_monitor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us know when the tests are ready.

@gunnaraasen gunnaraasen force-pushed the ga-azure-monitor branch 2 times, most recently from 08d6e68 to 5de442b Compare August 8, 2018 18:07
Copy link
Contributor

@glinton glinton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appears all that's left is to add your tests and update your branch

fields fields
wantErr bool
}{
// TODO: Add test cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests

@danielnelson danielnelson merged commit f70d651 into master Sep 5, 2018
@danielnelson danielnelson deleted the ga-azure-monitor branch September 5, 2018 21:50
rgitzel pushed a commit to rgitzel/telegraf that referenced this pull request Oct 17, 2018
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants