Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Mixer self-monitoring to use OpenCensus libraries #7989

Merged
merged 15 commits into from Aug 23, 2018

Conversation

douglas-reid
Copy link
Contributor

@douglas-reid douglas-reid commented Aug 16, 2018

This PR is part of an exploration of moving Istio self-monitoring from pure Prometheus libraries to OpenCensus (OC) libraries. The ultimate goal being explored is the possibility of creating an Istio OpenCensus exporter mechanism that would allow flexible export of Istio self-monitoring metrics to different backends (cloudwatch, datadog, stackdriver, etc.).

This PR does the following:

  • adds an OC gRPC stats handler to replace the Prometheus gRPC interceptor
  • tries to track down and remove all self-monitoring bits using Prometheus libraries (checkcache, runtime)
  • configures a Prometheus exporter for now (on same self-monitoring port)

This PR does not replace the Prometheus collectors that provide additional metrics (not from Mixer). These include:

  • Golang collector (for gauges on go-routines, etc.)
  • Process collector (linux process monitoring bits)

Reviewers:
I am interested in feedback on a few items. First, does this library change make sense? Second, is the way that OC is being used seem appropriate?

Important items to note when reviewing:
OC models things differently to prometheus and this results in some interesting changes:

  • For gauge aggregations, there isn't any way currently to express "last value +/- delta". So, in places where the codebase was using that sort of logic, new logic was added to separately track the value.
  • In establishing views, I've chosen here to mostly use view.Count. This is primarily because that results in exporting Prometheus metrics of type Counter. Using view.Sum results in untyped prometheus metrics. This ultimately impacts how we use the Stats, as view.Count records the number of measurements, not the total of those measurements. I think this is the right thing to do for now, but maybe not. The Prometheus docs include the following related documentation:

The Prometheus server does not yet make use of the type information and flattens all data into untyped time series. This may change in the future.

  • Because OC does not export stats for which no measurements that have been recorded, the overall output to prometheus will be different than before (we won't see 0s for things that haven't happened, etc.)

@codecov
Copy link

codecov bot commented Aug 16, 2018

Codecov Report

Merging #7989 into master will increase coverage by 1%.
The diff coverage is 91%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #7989   +/-   ##
======================================
+ Coverage      71%     71%   +1%     
======================================
  Files         374     370    -4     
  Lines       32673   32620   -53     
======================================
+ Hits        22906   22937   +31     
+ Misses       8760    8692   -68     
+ Partials     1007     991   -16
Impacted Files Coverage Δ
mixer/pkg/runtime/routing/table.go 100% <ø> (ø) ⬆️
mixer/pkg/server/server.go 97% <100%> (+7%) ⬆️
mixer/pkg/runtime/dispatcher/dispatchstate.go 94% <100%> (+1%) ⬆️
mixer/pkg/checkcache/cache.go 100% <100%> (ø) ⬆️
mixer/pkg/runtime/dispatcher/session.go 93% <50%> (-1%) ⬇️
mixer/pkg/runtime/config/snapshot.go 22% <72%> (+4%) ⬆️
mixer/pkg/runtime/routing/builder.go 68% <72%> (ø) ⬇️
mixer/pkg/server/monitoring.go 96% <75%> (+1%) ⬆️
mixer/pkg/runtime/handler/table.go 77% <82%> (-1%) ⬇️
mixer/pkg/runtime/handler/env.go 96% <90%> (+11%) ⬆️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a7ce28...c76dd92. Read the comment docs.

@douglas-reid douglas-reid added the do-not-merge/hold Block automatic merging of a PR. label Aug 16, 2018
@douglas-reid
Copy link
Contributor Author

cc: @Ramonza

Copy link
Contributor

@geeknoid geeknoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good.

Is there a solution to deal with the metrics used by our dependencies? This is an identical problem we had when replacing glog, which led to the istio/glog repo. Do we need something similar here?

return nil
}

// Get looks up an attribute bag in the cache.
func (cc *Cache) Get(attrs attribute.Bag) (Value, bool) {
defer func() { cc.recordStats() }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in this file, I'm not thrilled with "defer" here since it induces an allocation (last I looked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


exporter, err := prometheus.NewExporter(prometheus.Options{Registry: oprometheus.DefaultRegisterer.(*oprometheus.Registry)})
if err != nil {
return nil, fmt.Errorf("could not build prometheus exporter: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a unit test to cover this new code.


// Register the views to collect server request count.
if err := view.Register(ocgrpc.DefaultServerViews...); err != nil {
return nil, fmt.Errorf("could not register default server views: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need unit test for this.

missesTotal = stats.Int64("mixer/checkcache/cache_misses_total", "The number of times a cache lookup operation failed to find an entry in the cache.", stats.UnitDimensionless)
evictionsTotal = stats.Int64("mixer/checkcache/cache_evictions_total", "The number of entries that have been evicted from the cache.", stats.UnitDimensionless)

writesView = newView(writesTotal, []tag.Key{}, view.Sum())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want the LastValue aggregation type here. Otherwise, with Sum you will keep adding to a cumulative total and it looks like the actual values you're recording already represent a cumulative total so you really just want to record the value and not total up all the values you ever passed to stats.Record I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!


func newView(measure stats.Measure, keys []tag.Key, aggregation *view.Aggregation) *view.View {
return &view.View{
Name: measure.Name(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is the default if no name is specified

@@ -200,3 +183,10 @@ func (cc *Cache) Set(attrs attribute.Bag, value Value) {

cc.cache.SetWithExpiration(shape.makeKey(attrs), value, value.Expiration.Sub(now))
}

func (cc *Cache) recordStats() {
stats.Record(context.Background(), writesTotal.M(int64(cc.cache.Stats().Writes)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be coalesced into a single Record call:

cs := cc.cache.Stats()
stats.Record(
    context.Background(),
    hitsTotal.M(int64(cs.Hits)),
    missesTotal.M(int64(cs.Misses)),
     ...

@@ -105,7 +107,9 @@ func (s *session) dispatch() error {
namespace, err := getIdentityNamespace(s.bag)
if err != nil {
// early return.
updateRequestCounters(0, 0)
stats.Record(s.ctx, monitoring.DestinationsPerRequest.M(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be coalesced to a single Record call

@@ -186,7 +190,9 @@ func (s *session) dispatch() error {
}
}

updateRequestCounters(ndestinations, ninputs)
stats.Record(s.ctx, monitoring.DestinationsPerRequest.M(int64(ndestinations)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@douglas-reid douglas-reid changed the title WIP: Convert Mixer self-monitoring to use OpenCensus libraries Convert Mixer self-monitoring to use OpenCensus libraries Aug 22, 2018
@douglas-reid
Copy link
Contributor Author

Reviewers: PTAL.

I believe I have addressed the existing comments (adding unit tests, etc.). I have also updated the e2e_dashboard test (passes on local clusters) and the Mixer dashboard that is exercised by that test.

@douglas-reid douglas-reid removed the do-not-merge/hold Block automatic merging of a PR. label Aug 22, 2018
@douglas-reid
Copy link
Contributor Author

@geeknoid was there something you had in mind re: metrics used by our deps? I was not aware of any dependencies that were using prometheus to monitor anything prior to this PR.

I don't know that we need to solve that particular issue right now, but it does raise an interesting question of package instrumentation and playing well with others. @Ramonza does OC have any thoughts on such issues?

@douglas-reid
Copy link
Contributor Author

/test istio-unit-tests

and, of course, linting produces different results on every run... sigh.

@geeknoid
Copy link
Contributor

Sorry for my unclear comment. I was referring to this from your PR comment:

This PR does not replace the Prometheus collectors that provide additional metrics (not from Mixer). These include:

Golang collector (for gauges on go-routines, etc.)
Process collector (linux process monitoring bits)

Can these be addressed too at some point in a systematic way?

On a related note, the ControlZ library is currently mining Prometheus metrics for display in the UI. What happens in ControlZ after this PR goes in?

@douglas-reid
Copy link
Contributor Author

@geeknoid ah, gotcha. Yes, probably. But rather than duplicate that functionality in this PR (the key is really finding some hook to generate them on the fly the same way prometheus does, I believe), I've skipped over providing those in an OpenCensus way. I think, ultimately, it probably makes more sense for OpenCensus to provide those measures than for Istio to build them. For now, skipping them seems fine. @Ramonza thoughts here?

RE ControlZ:

It still works. We use the default registry (gatherer) for the prometheus exporter here, and that is what is currently powering the metrics portion of ControlZ.

@geeknoid
Copy link
Contributor

geeknoid commented Aug 22, 2018 via email

@douglas-reid
Copy link
Contributor Author

Thanks for all the reviews! I've fixed up all of the linter issues and dep diff stuff now. PTAL for final approval.

@geeknoid
Copy link
Contributor

/lgtm
/approve

@istio-testing
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: geeknoid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@istio-testing
Copy link
Collaborator

@douglas-reid: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
prow/istio-pilot-multicluster-e2e.sh c76dd92 link /test istio-pilot-multicluster-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@istio-testing istio-testing merged commit 6faa6cb into istio:master Aug 23, 2018
@douglas-reid douglas-reid deleted the mixer-opencensus branch August 27, 2019 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants