-
Notifications
You must be signed in to change notification settings - Fork 136
Add Performance Monitoring #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LiZhenCheng9527
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When to delete outdated map metrics.
- Refactor the struct2map function.
- The judgement of Error needs to specify the source of the error, if not necessary, you can not define the error first.
bpf/kmesh/probes/performance_probe.h
Outdated
| { | ||
| struct operation_usage_data data = {}; | ||
| struct operation_usage_key key = {}; | ||
| __u32 tid = bpf_get_current_pid_tgid() & 0xFFFFFFFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do & operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
low 32 bits returned by the bpf_get_current_pid_tgid() function represent the PID of the thread, the high 32 bits represent the TGID of the thread group
| struct operation_usage_data { | ||
| __u64 start_time; | ||
| __u64 end_time; | ||
| __u32 operation_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is operation_type? And why it is in both key and value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| }; | ||
|
|
||
| struct operation_usage_key { | ||
| __u32 tid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this pid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we map it to a pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tid is the unique identifier of a thread obtained through bpf_get_current_pid_tgid().
A unique kmesh-daemon runs on each node, and kmesh collects data from different nodes, allowing the data to be associated with the respective pods.
| continue | ||
| } | ||
| memLock := (info.KeySize + info.ValueSize) * info.MaxEntries | ||
| if memLock%4096 != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use macro
| } | ||
| startID = mapID | ||
| count++ | ||
| mapData.mapId = uint32(mapID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to split this function to ensure logical independence
Signed-off-by: skw <2438567342@qq.com>
|
@hzxuzhonghu |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure how this will influence the data plane performance.
| mapMetricCache map[mapMetricLabels]*mapUsageInfo | ||
| } | ||
|
|
||
| type mapUsageMetric struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please find a more proper name
bpf/kmesh/probes/performance_probe.h
Outdated
| __type(key, struct operation_usage_key); | ||
| __type(value, struct operation_usage_data); | ||
| __uint(max_entries, 1024); | ||
| } performance_data_map SEC(".maps"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the map name with kmesh prefix
bpf/kmesh/probes/performance_probe.h
Outdated
| struct { | ||
| __uint(type, BPF_MAP_TYPE_RINGBUF); | ||
| __uint(max_entries, RINGBUF_SIZE); | ||
| } map_perf_info SEC(".maps"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not remove performance_data_map map and make it inlined into the ringbuffer?
| keySize uint32 | ||
| valueSize uint32 | ||
| memLock uint64 | ||
| entryCount uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we export those attributes like bpftool -j?
| func (m *OperationMetricController) buildOperationMetric(data *operationUsageMetric) operationMetricLabels { | ||
| labels := operationMetricLabels{} | ||
| // Get the actual pod name | ||
| podName := os.Getenv("HOSTNAME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the pod name
| key.tid = tid; | ||
| data.start_time = bpf_ktime_get_ns(); | ||
| data.operation_type = operation_type; | ||
| bpf_map_update_elem(&kmesh_perf_map, &key, &data, BPF_ANY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not support concurrent connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pid_tgid as part of the key to solve the concurrency issue.
bpf/kmesh/probes/performance_probe.h
Outdated
| }; | ||
|
|
||
| struct operation_usage_key { | ||
| __u32 tid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pid, right?
| info->start_time = data->start_time; | ||
| info->end_time = data->end_time; | ||
| info->operation_type = data->operation_type; | ||
| bpf_ringbuf_submit(info, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only report thes info to userspace, donot we care about which pid is operating on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pid_tgid has been added to the labels.
| nodeName = "unknown-node-name" | ||
| } | ||
| labels.nodeName = nodeName | ||
| labels.operationType = fmt.Sprintf("%d", data.operationType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use string instead of int, this is exposed to user
bpf/kmesh/probes/performance_probe.h
Outdated
| __uint(type, BPF_MAP_TYPE_HASH); | ||
| __type(key, struct operation_usage_key); | ||
| __type(value, struct operation_usage_data); | ||
| __uint(max_entries, 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is too small
| ) | ||
|
|
||
| const ( | ||
| mapMetricFlushInterval = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it align with previous metrics, 15s iirc
| labels := operationMetricLabels{} | ||
| nodeName := os.Getenv("NODE_NAME") | ||
| if nodeName == "" { | ||
| nodeName = "unknown-node-name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| nodeName = "unknown-node-name" | |
| nodeName = "unknown" |
| } | ||
| labels.nodeName = nodeName | ||
| labels.operationType = operationTypeMap[data.operationType] | ||
| labels.pidTgid = fmt.Sprintf("%d", data.pidTgid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pure pid does not make sense, ideally we should have correlate with pod name/namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i suggest first leave it unset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ability to retrieve pidTgid is retained, but it is not added to the labels.
| struct operation_usage_data { | ||
| __u64 start_time; | ||
| __u64 end_time; | ||
| __u64 pid_tgid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we donot need to have this in both key and value
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we should have a feature flag to control whether to enable it
| type MapMetricController struct { | ||
| } | ||
|
|
||
| type mapEntrycountMetric struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: MapInfo
| } | ||
|
|
||
| type totalMapMetricLabels struct { | ||
| nodeName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this sturct, so similar with mapMetricLabels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @skwwwwww
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering code reuse and style before, I think there is no need to delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt isn't this duplicate with mapMetricLabels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is a label for a single map, and the other is a label for the total number of maps, requiring different struct to label different data
| if err != nil { | ||
| break | ||
| } | ||
| defer mapInfo.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not good practice to call defer within a loop
| defer mapInfo.Close() | ||
|
|
||
| if info.Name == "" { | ||
| startID = mapID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be always set
Signed-off-by: skw <2438567342@qq.com>
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |

What type of PR is this?
/kind feature