Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpp bridge created with serious delay for vlan connection #781

Closed
zolug opened this issue Dec 15, 2022 · 10 comments
Closed

vpp bridge created with serious delay for vlan connection #781

zolug opened this issue Dec 15, 2022 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@zolug
Copy link
Contributor

zolug commented Dec 15, 2022

This issue is related to the following PR: networkservicemesh/sdk-vpp#634

When an NSM connection is set up to a service provided by Remote VLAN NSE, the interface injected into the NSC POD is not usable for minutes. According to vppcli the bridge wasn't available (present) and showed up with the same delay.

What seemingly fixes the "connection"/bridge is the first refresh of the NSM connection.
Also, I rebuilt forwarder-vpp on the version where above changed got included, and run that forwarder on NSM v1.6 with the same results.

Note: When testing preliminary drop of the feature such delay was not visible, unfortunately I have no information of its code base (image = registry.nordix.org/cloud-native/nsm/cmd-forwarder-vpp:with-bd-wout-mac).

Reproduction (on Kind):

  • Install NSM basic (v1.7-rc1). Or rebuild forwarder-vpp on commit 5ed4cbb using it with NSM v1.6 components.
  • Deploy a Remote VLAN NSE and NSC.

cat <<EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nse-vlan
  labels:
    app: nse-vlan
spec:
  selector:
    matchLabels:
      app: nse-vlan
  template:
    metadata:
      labels:
        app: nse-vlan
        "spiffe.io/spiffe-id": "true"
    spec:
      containers:
        - name: nse
          image: registry.nordix.org/cloud-native/nsm/cmd-nse-remote-vlan:v1.6.0
          imagePullPolicy: Always
          env:
            - name: SPIFFE_ENDPOINT_SOCKET
              value: unix:///run/spire/sockets/agent.sock
            - name: NSM_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NSM_LOG_LEVEL
              value: TRACE
            - name: NSM_CONNECT_TO
              value: "nsm-registry-svc.nsm:5002"
            - name: NSM_CIDR_PREFIX
              value: "169.254.100.0/24,100:100::/64"
            - name: NSM_SERVICES
              value: "vlan-svc { vlan: 100; via: eth0 }"
            - name: NSM_LISTEN_ON
              value: "tcp://:5003"
            - name: NSM_POINT2POINT
              value: "False"
            - name: NSM_REGISTER_SERVICE
              value: "True"
          volumeMounts:
            - name: spire-agent-socket
              mountPath: /run/spire/sockets
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 40Mi
            limits:
              memory: 80Mi
              cpu: 200m
      nodeSelector:
        kubernetes.io/hostname: kind-worker2
      volumes:
        - name: spire-agent-socket
          hostPath:
            path: /run/spire/sockets
            type: Directory
EOF

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: nsc
  labels:
    app: nsc  
    "spiffe.io/spiffe-id": "true"
spec:
  containers:
  - name: nsc
    image: registry.nordix.org/cloud-native/nsm/cmd-nsc:v1.6.0
    imagePullPolicy: Always
    env:
      - name: NSM_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      - name: SPIFFE_ENDPOINT_SOCKET
        value: unix:///run/spire/sockets/agent.sock
      - name: NSM_LOG_LEVEL
        value: TRACE
      - name: NSM_NETWORK_SERVICES
        value: kernel://vlan-svc/ext-1
      - name: NSM_LIVENESSCHECKENABLED
        value: "false"
      - name: NSM_LOCALDNSSERVERENABLED
        value: "false"
      - name: NSM_DIAL_TIMEOUT
        value: "30s"
      - name: NSM_REQUEST_TIMEOUT
        value: "300s"
    volumeMounts:
      - name: spire-agent-socket
        mountPath: /run/spire/sockets
        readOnly: true
      - name: nsm-socket
        mountPath: /var/lib/networkservicemesh
        readOnly: true
    resources:
      requests:
        cpu: 100m
        memory: 40Mi
      limits:
        memory: 80Mi
        cpu: 200m
  volumes:
    - name: spire-agent-socket
      hostPath:
        path: /run/spire/sockets
        type: Directory
    - name: nsm-socket
      hostPath:
        path: /var/lib/networkservicemesh
        type: DirectoryOrCreate
  nodeSelector:
    kubernetes.io/hostname: kind-worker
EOF
  

  • Check vppcli on the forwarder hosting the NSC to see the delayed appearance of the bridge (while the NSC connection is already established):
root@kind-worker:/# vppctl show bridge-domain
no bridge-domains in use

Logs:
forwarder-vpp-ddqqp.5ed4cbb45dd25ee.log collocated with the NSC
First Request for service vlan-svc at Dec 15 14:51:22.210.
Refreshed at Dec 15 14:53:22.291 after which the bridge appeared in vpp.

root@kind-worker:/# date; vppctl show bridge-domain
Thu Dec 15 14:53:28 UTC 2022
  BD-ID   Index   BSN  Age(min)  Learning  U-Forwrd   UU-Flood   Flooding  ARP-Term  arp-ufwd Learn-co Learn-li   BVI-Intf 
 7271263    1      0     off        on        on       flood        on       off       off        0    16777216     N/A    
root@kind-worker:/# date; vppctl show bridge-domain 7271263 details
Thu Dec 15 14:53:37 UTC 2022
  BD-ID   Index   BSN  Age(min)  Learning  U-Forwrd   UU-Flood   Flooding  ARP-Term  arp-ufwd Learn-co Learn-li   BVI-Intf 
 7271263    1      0     off        on        on       flood        on       off       off        0    16777216     N/A    
             SPAN (span-l2-input)
   INPUT_CLASSIFY (l2-input-classify)
   INPUT_FEAT_ARC (l2-input-feat-arc)
     POLICER_CLAS (l2-policer-classify)
              ACL (l2-input-acl)
            VPATH (vpath-input-l2)
 L2_IP_QOS_RECORD (l2-ip-qos-record)
              VTR (l2-input-vtr)
            LEARN (l2-learn)
               RW (l2-rw)
              FWD (l2-fwd)
         UU_FLOOD (l2-flood)
            FLOOD (l2-flood)
         XCONNECT (l2-output)

           Interface           If-idx ISN  SHG  BVI  TxFlood        VLAN-Tag-Rewrite       
             tap0                3     1    1    -      *                 none             
         host-eth0.100           2     1    0    -      *                 pop-1   
@zolug zolug added the bug Something isn't working label Dec 15, 2022
@zolug
Copy link
Contributor Author

zolug commented Dec 15, 2022

@denis-tingaikin
Copy link
Member

@edwarnicke Is this a release blocker?

@zolug
Copy link
Contributor Author

zolug commented Dec 16, 2022

According to the logs during the initial connection request the "official" forwarder-vpp version skips the bridge setup part after the tap interface is configured:

5ed4cbb:

Dec 15 14:51:22.276ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [HostIfName:ext-1] [HostNamespace:/proc/1/fd/19] [TapFlags:TapFlags(0)] [duration:26.09085ms] [swIfIndex:3] [type:networkServic
e] [vppapi:TapCreateV2] ESC[0m(29.2)                               completed
Dec 15 14:51:22.276ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [duration:115.967µs] [mode:RX_MODE_API_ADAPTIVE] [swIfIndex:3] [type:networkService] [vppapi:SwInterfaceSetRxMode] ESC[0m(29.3)
                               completed
Dec 15 14:51:22.276ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [duration:57.061µs] [link.Name:ext-1] [netlink:LinkByName] [type:networkService] ESC[0m(29.4)                               com
pleted
Dec 15 14:51:22.276ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [alias:server-2f9a1e8a-2c6b-494f-b0c8-aaf80ce91200] [duration:27.377µs] [link.Name:ext-1] [netlink:LinkSetAlias] [type:networkS
ervice] ESC[0m(29.5)                               completed
Dec 15 14:51:22.276ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [duration:7.486µs] [link.Name:ext-1] [netlink:LinkSetUp] [type:networkService] ESC[0m(29.6)                               compl
eted
Dec 15 14:51:22.277ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [duration:55.379µs] [swIfIndex:3] [tag:3621c89d-8eb6-4a58-b15d-5556703fb51f] [type:networkService] [vppapi:SwInterfaceTagAddDel
] ESC[0m(26.1)                            completed
Dec 15 14:51:22.277ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [Addr:169.254.100.1/24] [duration:36.657µs] [link.Name:ext-1] [netlink:AddrAdd] [type:networkService] ESC[0m(24.1)                          completed
Dec 15 14:51:22.277ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [Addr:100:100::1/64] [duration:35.531µs] [link.Name:ext-1] [netlink:AddrAdd] [type:networkService] ESC[0m(24.2)                          completed
Dec 15 14:51:22.277ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [AddrUpdate.LinkAddress:{169.254.100.1 ffffff00}] [duration:11.052µs] [link.Name:ext-1] [type:networkService] ESC[0m(24.3)                          complete
Dec 15 14:51:22.277ESC[37m [DEBU] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [AddrUpdate.LinkAddress:{100:100::1 ffffffffffffffff0000000000000000}] [duration:88.521µs] [link.Name:ext-1] [type:networkService] ESC[0m(24.4)                          complete
Dec 15 14:51:22.278ESC[36m [INFO] [id:3621c89d-8eb6-4a58-b15d-5556703fb51f] [VLAN-ID:100] [type:networkService] ESC[0m(16.1)                  bridge is used instead of xconnect

registry.nordix.org/cloud-native/nsm/cmd-forwarder-vpp:with-bd-wout-mac:

Dec 15 15:23:51.092ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [HostIfName:ext-1] [HostNamespace:/proc/1/fd/19] [TapFlags:TapFlags(0)] [duration:42.254814ms] [swIfIndex:3] [type:networkServi
ce] [vppapi:TapCreateV2] ESC[0m(29.2)                               completed
Dec 15 15:23:51.092ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [duration:111.114µs] [mode:RX_MODE_API_ADAPTIVE] [swIfIndex:3] [type:networkService] [vppapi:SwInterfaceSetRxMode] ESC[0m(29.3)
                               completed
Dec 15 15:23:51.093ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [duration:44.337µs] [link.Name:ext-1] [netlink:LinkByName] [type:networkService] ESC[0m(29.4)                               com
pleted
Dec 15 15:23:51.093ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [alias:server-5de74de3-8008-4935-a417-3e58d7e30532] [duration:35.466µs] [link.Name:ext-1] [netlink:LinkSetAlias] [type:networkS
ervice] ESC[0m(29.5)                               completed
Dec 15 15:23:51.093ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [duration:7.448µs] [link.Name:ext-1] [netlink:LinkSetUp] [type:networkService] ESC[0m(29.6)                               compl
eted
Dec 15 15:23:51.093ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [duration:51.773µs] [swIfIndex:3] [tag:1245dadf-2476-486c-9550-3ad8520aad3a] [type:networkService] [vppapi:SwInterfaceTagAddDel
] ESC[0m(26.1)                            completed
Dec 15 15:23:51.093ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [Addr:169.254.100.1/24] [duration:39.922µs] [link.Name:ext-1] [netlink:AddrAdd] [type:networkService] ESC[0m(24.1)                          completed
Dec 15 15:23:51.093ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [Addr:100:100::1/64] [duration:42.47µs] [link.Name:ext-1] [netlink:AddrAdd] [type:networkService] ESC[0m(24.2)                          completed
Dec 15 15:23:51.093ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [AddrUpdate.LinkAddress:{169.254.100.1 ffffff00}] [duration:8.649µs] [link.Name:ext-1] [type:networkService] ESC[0m(24.3)                          complete
Dec 15 15:23:51.094ESC[37m [DEBU] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [AddrUpdate.LinkAddress:{100:100::1 ffffffffffffffff0000000000000000}] [duration:81.947µs] [link.Name:ext-1] [type:networkService] ESC[0m(24.4)                          complete
Dec 15 15:23:51.106ESC[36m [INFO] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [bridgeID:7271263] [duration:11.895076ms] [isAdd:true] [type:networkService] [vppapi:BridgeDomainAddDelV2] ESC[0m(18.1)                    completed
Dec 15 15:23:51.106ESC[36m [INFO] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [bridgeID:7271263] [duration:120.082µs] [isAdd:true] [shg:1] [swIfIndex:3] [type:networkService] [vppapi:SwInterfaceSetL2Bridge] ESC[0m(18.2)                    completed
Dec 15 15:23:51.106ESC[36m [INFO] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [bridgeID:7271263] [duration:146.24µs] [isAdd:true] [shg:0] [swIfIndex:2] [type:networkService] [vppapi:SwInterfaceSetL2Bridge] ESC[0m(18.3)                    completed
Dec 15 15:23:51.107ESC[36m [INFO] [id:1245dadf-2476-486c-9550-3ad8520aad3a] [VLAN-ID:100] [type:networkService] ESC[0m(16.1)                  bridge is used instead of xconnect

@zolug
Copy link
Contributor Author

zolug commented Dec 16, 2022

I guess the problem is that in l2bridgedomain request during the initial request the vlan id can not be loaded from the metadata yet as no response was received at that stage from the NSE. IMHO checking for vlan would make sense after next.Server.

The following changes fixed the problem for me:

diff --git a/pkg/networkservice/xconnect/l2bridgedomain/server.go b/pkg/networkservice/xconnect/l2bridgedomain/server.go
index 8267936..0a79f2a 100644
--- a/pkg/networkservice/xconnect/l2bridgedomain/server.go
+++ b/pkg/networkservice/xconnect/l2bridgedomain/server.go
@@ -46,9 +46,7 @@ func NewServer(vppConn api.Connection) networkservice.NetworkServiceServer {
 }
 
 func (v *l2BridgeDomainServer) Request(ctx context.Context, request *networkservice.NetworkServiceRequest) (*networkservice.Connection, error) {
-       vlanID, ok := vlan.Load(ctx, true)
-       // return if the belonging remote mechanism not vlan mechanism
-       if !ok || request.GetConnection().GetPayload() != payload.Ethernet {
+       if request.GetConnection().GetPayload() != payload.Ethernet {
                return next.Server(ctx).Request(ctx, request)
        }
 
@@ -59,6 +57,12 @@ func (v *l2BridgeDomainServer) Request(ctx context.Context, request *networkserv
                return nil, err
        }
 
+       // return if the belonging remote mechanism not vlan mechanism
+       vlanID, ok := vlan.Load(ctx, true)
+       if !ok {
+               return conn, err
+       }
+
        if err := addBridgeDomain(ctx, v.vppConn, &v.b, vlanID); err != nil {
                closeCtx, cancelClose := postponeCtxFunc()
                defer cancelClose()

@johlj
Copy link

johlj commented Dec 21, 2022

@edwarnicke @denis-tingaikin

Hi,

do you think it would be possible to get this into NSM 1.7? Not sure if we will be able to use the release without this fix

Br,
Johan

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Dec 21, 2022

@johlj Seems like it is fixed by #781

The fix #781 is included in v1.7.0

Could you please check this on
https://github.com/networkservicemesh/deployments-k8s/tree/release/v1.7.0?

@johlj
Copy link

johlj commented Dec 22, 2022

Hi,

ah it got in, great :)
We'll check but think it should be fine

Br,
Johan

@denis-tingaikin
Copy link
Member

@johlj Do you have any updates?

@johlj
Copy link

johlj commented Feb 14, 2023

Hi,

yes this issue is resolved

Br,
Johan

@denis-tingaikin
Copy link
Member

Nice!

Feel free to open new tickets for problems that you see 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants