Skip to content

Conversation

@skwwwwww
Copy link
Contributor

What type of PR is this?

/kind feature

@codecov
Copy link

codecov bot commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 1.09290% with 181 lines in your changes missing coverage. Please review.

Project coverage is 48.45%. Comparing base (1902392) to head (efb1588).
Report is 75 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/telemetry/map_metric.go 0.00% 94 Missing ⚠️
pkg/controller/telemetry/operation_metric.go 0.00% 78 Missing ⚠️
pkg/controller/workload/workload_controller.go 0.00% 7 Missing ⚠️
pkg/controller/controller.go 0.00% 2 Missing ⚠️
Files with missing lines Coverage Δ
pkg/controller/telemetry/utils.go 64.81% <100.00%> (+1.35%) ⬆️
pkg/controller/controller.go 0.00% <0.00%> (ø)
pkg/controller/workload/workload_controller.go 53.84% <0.00%> (-4.49%) ⬇️
pkg/controller/telemetry/operation_metric.go 0.00% <0.00%> (ø)
pkg/controller/telemetry/map_metric.go 0.00% <0.00%> (ø)

... and 24 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4ca182...efb1588. Read the comment docs.

Copy link
Contributor

@LiZhenCheng9527 LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. When to delete outdated map metrics.
  2. Refactor the struct2map function.
  3. The judgement of Error needs to specify the source of the error, if not necessary, you can not define the error first.

{
struct operation_usage_data data = {};
struct operation_usage_key key = {};
__u32 tid = bpf_get_current_pid_tgid() & 0xFFFFFFFF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do & operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low 32 bits returned by the bpf_get_current_pid_tgid() function represent the PID of the thread, the high 32 bits represent the TGID of the thread group

struct operation_usage_data {
__u64 start_time;
__u64 end_time;
__u32 operation_type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is operation_type? And why it is in both key and value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operation_type refers to a function, such as sock_traffic_control.
sock_traffic_control

};

struct operation_usage_key {
__u32 tid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this pid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we map it to a pod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tid is the unique identifier of a thread obtained through bpf_get_current_pid_tgid().

A unique kmesh-daemon runs on each node, and kmesh collects data from different nodes, allowing the data to be associated with the respective pods.

continue
}
memLock := (info.KeySize + info.ValueSize) * info.MaxEntries
if memLock%4096 != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use macro

}
startID = mapID
count++
mapData.mapId = uint32(mapID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to split this function to ensure logical independence

Signed-off-by: skw <2438567342@qq.com>
@skwwwwww
Copy link
Contributor Author

@hzxuzhonghu
The code has been improved and conflicts have been resolved

@skwwwwww skwwwwww requested a review from YaoZengzeng October 27, 2024 14:01
Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure how this will influence the data plane performance.

mapMetricCache map[mapMetricLabels]*mapUsageInfo
}

type mapUsageMetric struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please find a more proper name

__type(key, struct operation_usage_key);
__type(value, struct operation_usage_data);
__uint(max_entries, 1024);
} performance_data_map SEC(".maps");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the map name with kmesh prefix

struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, RINGBUF_SIZE);
} map_perf_info SEC(".maps");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not remove performance_data_map map and make it inlined into the ringbuffer?

keySize uint32
valueSize uint32
memLock uint64
entryCount uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we export those attributes like bpftool -j?

func (m *OperationMetricController) buildOperationMetric(data *operationUsageMetric) operationMetricLabels {
labels := operationMetricLabels{}
// Get the actual pod name
podName := os.Getenv("HOSTNAME")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the pod name

key.tid = tid;
data.start_time = bpf_ktime_get_ns();
data.operation_type = operation_type;
bpf_map_update_elem(&kmesh_perf_map, &key, &data, BPF_ANY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not support concurrent connect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pid_tgid as part of the key to solve the concurrency issue.

};

struct operation_usage_key {
__u32 tid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pid, right?

info->start_time = data->start_time;
info->end_time = data->end_time;
info->operation_type = data->operation_type;
bpf_ringbuf_submit(info, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you only report thes info to userspace, donot we care about which pid is operating on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pid_tgid has been added to the labels.

nodeName = "unknown-node-name"
}
labels.nodeName = nodeName
labels.operationType = fmt.Sprintf("%d", data.operationType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use string instead of int, this is exposed to user

__uint(type, BPF_MAP_TYPE_HASH);
__type(key, struct operation_usage_key);
__type(value, struct operation_usage_data);
__uint(max_entries, 1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is too small

)

const (
mapMetricFlushInterval = 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it align with previous metrics, 15s iirc

labels := operationMetricLabels{}
nodeName := os.Getenv("NODE_NAME")
if nodeName == "" {
nodeName = "unknown-node-name"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nodeName = "unknown-node-name"
nodeName = "unknown"

}
labels.nodeName = nodeName
labels.operationType = operationTypeMap[data.operationType]
labels.pidTgid = fmt.Sprintf("%d", data.pidTgid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pure pid does not make sense, ideally we should have correlate with pod name/namespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i suggest first leave it unset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to retrieve pidTgid is retained, but it is not added to the labels.

struct operation_usage_data {
__u64 start_time;
__u64 end_time;
__u64 pid_tgid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we donot need to have this in both key and value

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we should have a feature flag to control whether to enable it

type MapMetricController struct {
}

type mapEntrycountMetric struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MapInfo

}

type totalMapMetricLabels struct {
nodeName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this sturct, so similar with mapMetricLabels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @skwwwwww

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering code reuse and style before, I think there is no need to delete it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt isn't this duplicate with mapMetricLabels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is a label for a single map, and the other is a label for the total number of maps, requiring different struct to label different data

if err != nil {
break
}
defer mapInfo.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not good practice to call defer within a loop

defer mapInfo.Close()

if info.Name == "" {
startID = mapID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be always set

Signed-off-by: skw <2438567342@qq.com>
hzxuzhonghu
hzxuzhonghu previously approved these changes Oct 31, 2024
Signed-off-by: skw <2438567342@qq.com>
Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kmesh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kmesh-bot kmesh-bot merged commit 4b403b2 into kmesh-net:main Oct 31, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants