Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: add streaming metrics #3048

Merged
merged 9 commits into from
Oct 15, 2020
Merged

Conversation

kyessenov
Copy link
Contributor

@kyessenov kyessenov commented Oct 12, 2020

Signed-off-by: Kuat Yessenov kuat@google.com

  • Adds metrics for request and response message counts.
    These are enabled by default, but I can add a guard.

  • Do not report 0 counter values. This is useful for gRPC counts since the metric applies to all HTTP traffic. Peers may interchange HTTP and gRPC traffic on the same connection, so we cannot predict whether gRPC metric is not needed.

  • Interval reporting will be in a follow-up.

Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov kyessenov requested a review from a team October 12, 2020 22:23
@kyessenov kyessenov requested a review from a team as a code owner October 12, 2020 22:23
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 12, 2020
@google-cla google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Oct 12, 2020
Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
value: default
- name: request_protocol
value: grpc
- name: response_code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need response_code on these metrics? it is always 200, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is redundant. It's an artifact of the implementation, that should be fixed I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

- name: grpc_response_status
value: "{{ .Vars.GrpcResponseStatus }}"
- name: response_flags
value: DC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect DC in the base case for client metrics? does response_flags add value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DC is downstream closure (client closed connection before server responded, intentionally). I think it's valuable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

value: server
- name: destination_service_namespace
value: default
- name: request_protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder, is this useful? will these metrics already imply grpc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The same problem is with TCP (it's also tcp there). It's not useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

[](const ::Wasm::Common::RequestInfo& request_info)
-> uint64_t { return request_info.response_size; },
false},
// GRPC metrics.
MetricFactory{
"request_messages", MetricType::Counter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider making these request_messages_total and response_messages_total to align with prom best practices: https://prometheus.io/docs/practices/naming/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we went with _count instead of _total ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, renamed to total.

Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov
Copy link
Contributor Author

Removed a bunch of labels and can remove more. They can be back-filled with metric customization.

@kyessenov
Copy link
Contributor Author

/test release-centos-test_proxy

@kyessenov
Copy link
Contributor Author

/test test-tsan_proxy

@istio-testing
Copy link
Collaborator

istio-testing commented Oct 14, 2020

@kyessenov: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
release-centos-test_proxy 1bfc85b link /test release-centos-test_proxy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kyessenov
Copy link
Contributor Author

/test test-tsan_proxy

@kyessenov
Copy link
Contributor Author

@douglas-reid @mandarjog is this good to go for non-interval reporting? I'm afraid interval reporting is going to run into some Wasm lifecycle issue, so want to separate it out for now.

@@ -391,6 +395,19 @@ void populateTCPRequestInfo(bool outbound, RequestInfo* request_info,
request_info->request_protocol = kProtocolTCP;
}

bool populateGRPCInfo(RequestInfo* request_info) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to return boolean here? does not seem to be used and it always returns false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns true if it parses correctly (SimpleAtoi return bool). It's not used yet, but seems reasonable to have a status return.

[](const ::Wasm::Common::RequestInfo& request_info) -> uint64_t {
return request_info.request_message_count;
},
false, count_peer_labels},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments here about what labels are ignored for this metric?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would connection_security_policy worth to be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we are considering that it is already tracked by request total, thus it is a duplicated here and this should just focus on the message count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I only put peer labels, since the other metrics apply to the outer request rather than individual messages within them. I'm not sure if we want more, so I am starting from the minimal set.

"WasmRuntime": "envoy.wasm.runtime.null",
"DisableDirectResponse": "true",
"UsingGrpcBackend": "true",
"GrpcResponseStatus": "2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

@bianpengyuan bianpengyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg, only minor comments.

Signed-off-by: Kuat Yessenov <kuat@google.com>
MetricFactory{
"response_messages_count", MetricType::Counter,
[](const ::Wasm::Common::RequestInfo& request_info) -> uint64_t {
return request_info.response_message_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about metric for avg size of these messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be difficult I think. The messages are framed by another filter, so we don't have a way to accumulate their sizes here. Open to any suggestions how to make that possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful.
@kyessenov is it possible to add it upstream? We need a distribution of message sizes.
We need not hold this PR for it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can report size distribution to native envoy stats, but I don't know how to get it out back here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure either... but can we use grpc stats filter only to get message size and add it in the above stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be many messages, so we'd be passing a large list of values for sizes. We would need histogram summaries I think, to make it work.

Signed-off-by: Kuat Yessenov <kuat@google.com>
Copy link
Contributor

@mandarjog mandarjog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
6 participants