Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change kubeletEndpoint port when cloudstream conneted #3277

Merged
merged 3 commits into from
Nov 30, 2021

Conversation

chenchunxiu
Copy link
Member

@chenchunxiu chenchunxiu commented Nov 8, 2021

Signed-off-by: chenchunxiu chenchunxiu_yewu@cmss.chinamobile.com

What type of PR is this?
/kind bug

What this PR does / why we need it:
If cloudcore is multi-active (cloudcore1, cloudcore2, cloudcore3) and LB is used.
In bellowing case, kubectl exec/logs failed.

LB forward connection of cloudhub (from edge-node-01) to cloudcore1. --- cloudhub port 10000
LB forward connection of cloudstream (from edge-node-01) to cloudcore2. --- cloudstream port 10004

So, we should change kubeletEndpoint when cloudstrem conneted, not cloudhub connected.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


@kubeedge-bot kubeedge-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 8, 2021
@kubeedge-bot kubeedge-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 8, 2021
Signed-off-by: chenchunxiu <chenchunxiu_yewu@cmss.chinamobile.com>
@chenchunxiu
Copy link
Member Author

anybody has time to review? @fisherxu

@zc2638
Copy link
Member

zc2638 commented Nov 16, 2021

Thanks for fixing.
Can you add more descriptions of this patch?

Is it because the node has been created, and then LB needs to be added to the cloud, and the kubeletEndpoint needs to be changed at this time?

@chenchunxiu
Copy link
Member Author

Thanks for fixing. Can you add more descriptions of this patch?

Is it because the node has been created, and then LB needs to be added to the cloud, and the kubeletEndpoint needs to be changed at this time?

When cloudcore is multi-active and LB is used, there is no guarantee that LB will forward the edgehub connection and edgestream connection of the same node to the same cloudcore.
If LB forwards both edgehub and edgestream connections to the same cloudcore, kubectl logs/exec works normally
image

If LB forwards both edgehub and edgestream connections to different cloudcore, kubectl logs/exec will not work because the kubectl logs/exec request is forwarded to the cloudcore without stream connection.
image

Therefore, we should keep the setting of kubeletEndpointPort consistent with the cloudcore where cloudstream is connected.
image

@zc2638
Copy link
Member

zc2638 commented Nov 17, 2021

I have some doubts.

The original logic is to update the kubeletEndpoint when the node is created or the node status is updated.
Now, when the edge node connects to the cloud stream, first update the kubeletEndpoint.

What is the difference between them?
What is the difference between processing this logic in different places?

@kubeedge-bot kubeedge-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2021
@zc2638
Copy link
Member

zc2638 commented Nov 17, 2021

For edge nodes, only the address of LB.
When edgestream connects to the corresponding cloudstream, confirm the proxy port of the cloud.

e.g. maybe edgestream connects to cloudstream A, so we update kubeletEndpoint to 10351.
e.g. maybe edgestream connects to cloudstream B, so we update kubeletEndpoint to 10352.

This is my understanding of your explanation, is that the case?
@chenchunxiu

@kevin-wangzefeng
Copy link
Member

This is caused by inconsistent load-balancing on multiple requests from a some edge node.
To quick fix, you may enable features like "sticky session", or "route by source IP" provided by the LB.

@fisherxu
Copy link
Member

For edge nodes, only the address of LB. When edgestream connects to the corresponding cloudstream, confirm the proxy port of the cloud.

e.g. maybe edgestream connects to cloudstream A, so we update kubeletEndpoint to 10351. e.g. maybe edgestream connects to cloudstream B, so we update kubeletEndpoint to 10352.

This is my understanding of your explanation, is that the case? @chenchunxiu

CloudStream/EdgeStream use one connection, and the cloudhub/edgehub use another connection. The two connections may on two servers behide LB, and we use the cloud/edge hub to update the nodestatus, so CloudStream/EdgeStream will not work. @zc2638

Copy link
Member

@fisherxu fisherxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please resolve the confict :) @chenchunxiu

@kubeedge-bot kubeedge-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 17, 2021
@kubeedge-bot kubeedge-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 17, 2021
@kubeedge-bot kubeedge-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2021
@chenchunxiu
Copy link
Member Author

Looks good, please resolve the confict :) @chenchunxiu

Done

getNode.Status.DaemonEndpoints.KubeletEndpoint.Port = int32(s.tunnelPort)
node, err := client.GetKubeClient().CoreV1().Nodes().UpdateStatus(context.Background(), getNode, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("update node KubeletEndpoint Port failed with error: %s, node: %v, tunnelPort:%v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@chenchunxiu
Copy link
Member Author

ping @fisherxu @zc2638

Comment on lines 191 to 193
if errors.IsNotFound(err) {
klog.Warningf("node %s not found", nodeName)
return false, nil
}
if err != nil {
klog.Errorf("Failed while getting a Node to retry updating node KubeletEndpoint Port. error: %v", nodeName, err)
return false, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if errors.IsNotFound(err) {
klog.Warningf("node %s not found", nodeName)
return false, nil
}
if err != nil {
klog.Errorf("Failed while getting a Node to retry updating node KubeletEndpoint Port. error: %v", nodeName, err)
return false, err
}
if err != nil {
klog.Errorf("Failed while getting a Node to retry updating node KubeletEndpoint Port, node: %s, error: %v", nodeName, err)
return false, nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

getNode.Status.DaemonEndpoints.KubeletEndpoint.Port = int32(s.tunnelPort)
_, err = client.GetKubeClient().CoreV1().Nodes().UpdateStatus(context.Background(), getNode, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("update node KubeletEndpoint Port failed with error: %s, node: %v, tunnelPort: %v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("update node KubeletEndpoint Port failed with error: %s, node: %v, tunnelPort: %v",
klog.Errorf("Failed to update node KubeletEndpoint Port, node: %s, tunnelPort: %s, err: %v", getNode.Name, s.tunnelPort, err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

klog.Errorf("Update KubeletEndpoint Port of Node '%v' error: %v. ", nodeName, err)
os.Exit(1)
}
klog.Infof("update node KubeletEndpoint Port success. node: %s, tunnelPort: %v", nodeName, s.tunnelPort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("update node KubeletEndpoint Port success. node: %s, tunnelPort: %v", nodeName, s.tunnelPort)
klog.V(2).Infof("Update node KubeletEndpoint Port successfully, node: %s, tunnelPort: %s", nodeName, s.tunnelPort)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V(4) maybe more fine, or large scale edge nodes will print lots of success logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
return true, nil
}); err != nil {
klog.Errorf("Update KubeletEndpoint Port of Node '%v' error: %v. ", nodeName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("Update KubeletEndpoint Port of Node '%v' error: %v. ", nodeName, err)
klog.Errorf("Failed to update node KubeletEndpoint Port, node: %s, tunnelPort: %s, err: %v", nodeName, s.tunnelPort, err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"github.com/kubeedge/kubeedge/pkg/stream"
)

const (
// The amount of time the tunnelserver should sleep between retrying node status updates
retrySleepTime = 20 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retrySleepTime = 20 * time.Millisecond
retrySleepTime = 20 * time.Second

The retry time can be 20s here, if there is a delay in edge node creation, here need to wait more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -169,3 +184,30 @@ func (s *TunnelServer) Start() {
return
}
}

func (s *TunnelServer) updateNodeKubeletEndpoint(nodeName string) {
if err := wait.PollImmediate(retrySleepTime, nodeStatusUpdateRetry, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := wait.PollImmediate(retrySleepTime, nodeStatusUpdateRetry, func() (bool, error) {
if err := wait.Poll(retrySleepTime, nodeStatusUpdateRetry, func() (bool, error) {

Let's use wait.Poll here, PollImmediate will exit when err occur, here we need to retry.

Copy link
Member Author

@chenchunxiu chenchunxiu Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply.
I test in my machine, if err happens retry will execute.
wait.PollImmediate will run the func immediately, and wait.Poll will wait for the interval before run the func

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey, sounds good, and so we need to return false,nil as I comment above. Or the retry will exit directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply, I have changed the return false, err to return false, nil.

type TunnelServer struct {
container *restful.Container
upgrader websocket.Upgrader
sync.Mutex
sessions map[string]*Session
nodeNameIP sync.Map
tunnelPort int
Copy link
Member

@gy95 gy95 Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stupid question 😄 Is this tunnelPort from the configuration file and need to be configured by users themselves?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, users do not need to configure manually.
tunnelPort will be automatically generated when cloudcore is started at first time and saved in configmap. It will be read directly from configmap the next time cloudcore started.

Signed-off-by: chenchunxiu <chenchunxiu_yewu@cmss.chinamobile.com>
Copy link
Member

@fisherxu fisherxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
Sorry for the delay.

@kubeedge-bot kubeedge-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2021
@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fisherxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2021
@kubeedge-bot kubeedge-bot merged commit c50af6a into kubeedge:master Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants