Skip to content

Commit

Permalink
fix: lsp may lost when server pressure is high
Browse files Browse the repository at this point in the history
When server under high pressure, etcd and ovn-db changes leader frequently, sometimes the pod has the allocation ready annotation but the lsp is lost.

Some possible reasons are:
1. When ovn-db changes the leader, the lsp data is not propagate to all instances and leader change, the lsp is not persistent but we have already write back the annotation.

2. The lsp has been written to all ovn-db instances, but etcd changes the leader the apiserver takes long time to process the annotation write request. During this period the gc process is triggered and recycles the lsp before the annotation write success.

To resolve it:
1. Add --wait=sb to ovn-nbctl and hope it can persist more data before leader change to amend 1

2. Extend gc interval to amend 2

3. When it happens again log it
  • Loading branch information
oilbeater committed Jun 18, 2021
1 parent cfabf16 commit 8ed91be
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 3 deletions.
2 changes: 1 addition & 1 deletion pkg/controller/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -497,7 +497,7 @@ func (c *Controller) startWorkers(stopCh <-chan struct{}) {
if err := c.markAndCleanLSP(); err != nil {
klog.Errorf("gc lsp error %v", err)
}
}, 30*time.Second, stopCh)
}, 6*time.Minute, stopCh)

go wait.Until(func() {
c.syncExternalVpc()
Expand Down
8 changes: 7 additions & 1 deletion pkg/controller/gc.go
Original file line number Diff line number Diff line change
Expand Up @@ -244,8 +244,14 @@ func (c *Controller) markAndCleanLSP() error {
}
}
}

lastNoPodLSP = noPodLSP

for _, ipName := range ipNames {
if !util.IsStringIn(ipName, lsps) {
klog.Errorf("lsp lost for pod %s, please delete the pod and retry", ipName)
}
}

return nil
}

Expand Down
5 changes: 4 additions & 1 deletion pkg/ovs/ovn-nbctl.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ import (

func (c Client) ovnNbCommand(cmdArgs ...string) (string, error) {
start := time.Now()
cmdArgs = append([]string{fmt.Sprintf("--timeout=%d", c.OvnTimeout)}, cmdArgs...)
cmdArgs = append([]string{fmt.Sprintf("--timeout=%d", c.OvnTimeout), "--wait=sb"}, cmdArgs...)
raw, err := exec.Command(OvnNbCtl, cmdArgs...).CombinedOutput()
elapsed := float64((time.Since(start)) / time.Millisecond)
klog.V(4).Infof("command %s %s in %vms, output %q", OvnNbCtl, strings.Join(cmdArgs, " "), elapsed, raw)
Expand Down Expand Up @@ -142,6 +142,9 @@ func (c Client) CreatePort(ls, port, ip, cidr, mac, tag, pod, namespace string,
if pod != "" && namespace != "" {
ovnCommand = append(ovnCommand,
"--", "set", "logical_switch_port", port, fmt.Sprintf("external_ids:pod=%s/%s", namespace, pod), fmt.Sprintf("external_ids:vendor=%s", util.CniTypeName))
} else {
ovnCommand = append(ovnCommand,
"--", "set", "logical_switch_port", port, fmt.Sprintf("external_ids:vendor=%s", util.CniTypeName))
}

if _, err := c.ovnNbCommand(ovnCommand...); err != nil {
Expand Down

0 comments on commit 8ed91be

Please sign in to comment.