-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A80: gRPC Metrics for TCP connection #428
base: master
Are you sure you want to change the base?
Changes from all commits
dedfb16
ffaeb22
d413291
5b5ba3f
583e6b3
8aa21c1
9f8038c
59ab138
ce27a69
d239c39
0726f6e
3bfe76b
2ccf768
83ac908
2a11aea
b6dc6d9
0aceebe
052d5cf
7e5bc86
092fbc1
bd18940
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
A80: gRPC Metrics for TCP connection | ||
---- | ||
* Author(s): Yash Tibrewal (@yashykt), Nana Pang (@nanahpang), Yousuk Seung (@yousukseung) | ||
* Approver: Craig Tiller (@ctiller), Mark Roth (@markdroth) | ||
* Status: {Draft, In Review, Ready for Implementation, Implemented} | ||
* language: {...} | ||
* Last updated: 2024-04-18 | ||
* Discussion at: https://groups.google.com/g/grpc-io/c/AyT0LVgoqFs | ||
|
||
## Abstract | ||
|
||
This document proposes adding new TCP connection metrics to gRPC for improved network analysis and debugging. | ||
|
||
## Background | ||
|
||
To improve the network debugging capabilities for gRPC users, we propose adding per-connection TCP metrics in gRPC. The metrics will utilize the metrics framework outlined in [A79]. | ||
|
||
### Related Proposals: | ||
* [A79]: gRPC Non-Per-Call Metrics Framework | ||
|
||
[A79]: A79-non-per-call-metrics-architecture.md | ||
|
||
## Proposal | ||
|
||
This document proposes changes to the following gRPC components. | ||
|
||
### Per-Connection TCP Metrics | ||
|
||
We will provide the following metrics: | ||
- `grpc.tcp.min_rtt` | ||
- `grpc.tcp.delivery_rate` | ||
- `grpc.tcp.packets_sent` | ||
- `grpc.tcp.packets_retransmitted` | ||
- `grpc.tcp.packets_spurious_retransmitted` | ||
|
||
The metrics will be exported as: | ||
|
||
| Name | Type | Unit | Labels | Description | | ||
| ------------- | ----- | ----- | ------- | ----------- | | ||
| grpc.tcp.min_rtt | Histogram (floating-point) | s | None | Records TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints.<br /> RTT = packet acked timestamp - packet sent timestamp. | | ||
| grpc.tcp.delivery_rate | Histogram (floating-point) | bit/s | None | Records latest goodput measured of the TCP connection. <br /> Elapsed time = packet acked timestamp - last packet acked timestamp. <br /> Delivery rate = packet acked bytes / elapsed time. | | ||
| grpc.tcp.packets_sent | Counter (integer) | {packet} | None | Records total packets TCP sends in the calculation period. | | ||
| grpc.tcp.packets_retransmitted | Counter (integer) | {packet} | None | Records total packets lost in the calculation period, including lost or spuriously retransmitted packets. | | ||
| grpc.tcp.packets_spurious_retransmitted | Counter (integer) | {packet} | None | Records total packets spuriously retransmitted packets in the calculation period. These are retransmissions that TCP later discovered unnecessary.| | ||
|
||
#### Metric Collection Design | ||
|
||
A high-level approach to collecting TCP metrics (on Linux) is as follows: | ||
1) **Enable Network Timestamps for Metric Calculation:** Enable the `SO_TIMESTAMPING` option in the kernel's TCP stack through the `setsocketopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val))` system call. This enables the kernel to capture packet timestamps during transmission. | ||
2) **Calculate Metrics from Timestamps:** Linux kernel calculates TCP connection metrics based on the captured packet timestamps. These metrics can be retrieved using the `getsockopt(TCP_INFO)` system call. For example, the delivery_rate metric estimates the goodput—the rate of useful data transmitted—for the most recent group of outbound data packets within a single flow ([code](https://elixir.bootlin.com/linux/v5.11.1/source/net/ipv4/tcp.c#L391)). | ||
3) **Periodically Collect Statistics:** At a specified time interval (e.g., every 5 minutes), gRPC aggregates the calculated metrics and updates the corresponding statistics records. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That could be a bit more precise. I have 3 questions:
That's the only thing that really makes sense given the definition of the metrics above, so perhaps it's fine to leave it implicit.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think in C++ the interval is not fixed, and the default value is 5 minutes in Fathom. For other language implementations, the interval can be adjusted as needed. @yousukseung for other questions of the Fathom implementation. Thanks. For context, this high-level plan aims to provide a general understanding of the existing metric collection process in C++ (implemented through Fathom), while offering flexibility for adaptation in other languages. To maintain clarity and focus, implementation details have been omitted from this proposal and can be found in the Fathom documentation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "At a specified time interval" means the interval will be configurable. How is it configured? |
||
|
||
A detailed explanation of the design can be found in the Fathom documentation. | ||
|
||
#### Reference: | ||
* Fathom: https://dl.acm.org/doi/pdf/10.1145/3603269.3604815 | ||
* Kernel TCP Timestamping: https://www.kernel.org/doc/Documentation/networking/timestamping.rst | ||
* Delivery Rate: https://datatracker.ietf.org/doc/html/draft-cheng-iccrg-delivery-rate-estimation#name-delivery-rate | ||
|
||
### Metric Stability | ||
|
||
All metrics added in this proposal will start as experimental. The long term goal will be to | ||
de-experimentalize them and have them be on by default, but the exact | ||
criteria for that change are TBD. | ||
|
||
### Temporary environment variable protection | ||
|
||
This proposal does not include any features enabled via external I/O, so | ||
it does not need environment variable protection. | ||
|
||
## Implementation | ||
|
||
Will be implemented in C-core, and currently have no plans to implement in other languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to know what the mapping is for each metric. The first few are easy because the names here mirror tcp_info (I assume). But then it gets less obvious, and tcp_info isn't documented very well.