Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy Base Routing failure; Possible table ID collision. #9119

Closed
ljkiraly opened this issue May 2, 2023 · 2 comments
Closed

Policy Base Routing failure; Possible table ID collision. #9119

ljkiraly opened this issue May 2, 2023 · 2 comments
Assignees

Comments

@ljkiraly
Copy link
Contributor

ljkiraly commented May 2, 2023

Expected Behavior

On target node the routing table ID should be different for different source IPs.

# ip rule
0:      from all lookup local
32750:  from 214.14.131.114 lookup 3
32751:  from 214.14.131.113 lookup 1
32752:  from 214.14.132.66 lookup 2
32753:  from 214.14.132.65 lookup 4
32766:  from all lookup main
32767:  from all lookup default

# ip route show table 3
default via 172.16.1.1 dev nsm-1 onlink 
# ip route show table 2
default via 172.16.16.1 dev nsm-0 onlink 
# ip route show table 1
default via 172.16.1.1 dev nsm-1 onlink 
# ip route show table 4
default via 172.16.16.1 dev nsm-0 onlink 
# ip -4 a
...
15: nsm-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN group default qlen 1000
    inet 172.16.16.4/24 brd 172.16.16.255 scope global nsm-0
       valid_lft forever preferred_lft forever
    inet 214.14.132.65/32 scope global nsm-0
       valid_lft forever preferred_lft forever
    inet 214.14.132.66/32 scope global nsm-0
       valid_lft forever preferred_lft forever
16: nsm-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN group default qlen 1000
    inet 172.16.1.4/24 brd 172.16.1.255 scope global nsm-1
       valid_lft forever preferred_lft forever
    inet 214.14.131.113/32 scope global nsm-1
       valid_lft forever preferred_lft forever
    inet 214.14.131.114/32 scope global nsm-1
       valid_lft forever preferred_lft forever

Current Behavior

For some reason can happen that some route is missing and the rule point to wrong table ID:

# ip rule
0:      from all lookup local
32750:  from 214.14.131.114 lookup 3
32751:  from 214.14.131.113 lookup 1
32752:  from 214.14.132.66 lookup 2
32753:  from 214.14.132.65 lookup 1
32766:  from all lookup main
32767:  from all lookup default

# ip route show table 3
default via 172.16.1.1 dev nsm-1 onlink 
# ip route show table 2
default via 172.16.16.1 dev nsm-0 onlink 
# ip route show table 1
default via 172.16.1.1 dev nsm-1 onlink 

# ip -4 a
...
15: nsm-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN group default qlen 1000
    inet 172.16.16.4/24 brd 172.16.16.255 scope global nsm-0
       valid_lft forever preferred_lft forever
    inet 214.14.132.65/32 scope global nsm-0
       valid_lft forever preferred_lft forever
    inet 214.14.132.66/32 scope global nsm-0
       valid_lft forever preferred_lft forever
16: nsm-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN group default qlen 1000
    inet 172.16.1.4/24 brd 172.16.1.255 scope global nsm-1
       valid_lft forever preferred_lft forever
    inet 214.14.131.113/32 scope global nsm-1
       valid_lft forever preferred_lft forever
    inet 214.14.131.114/32 scope global nsm-1
       valid_lft forever preferred_lft forever

Failure Information (for bugs)

As I see the forwarder stores the routing table IDs into sync.Map keyed by connection ID. In theory two concurrent connection requests coming from the same NSC but towards different NSs can get the same table ID since there are no semaphores used in getFreeTableID call. https://github.com/networkservicemesh/sdk-kernel/blob/0ca43b02fb6f2b9467d547f79825f12075bda5a4/pkg/kernel/networkservice/connectioncontextkernel/ipcontext/iprule/common.go#L79

Steps to Reproduce

The logs were on info level when this happened. The reproduction steps not yet known.
I will update the issue if I am able to reproduce it.

  • NSM Version: v1.6.1
@NikitaSkrynnik
Copy link
Collaborator

Hi @ljkiraly. Yes, there might be a problem if one NSC sends two parallel requests. There are two options how to fix this:

  1. Add mutexes on NSC to make it do requests sequentially
  2. Two requests can go in parallel, but the first path segment should be the same in both of them. The reason for this is that begin chain element synchronizes requests based on their connection ID

ljkiraly added a commit to Nordix/nsm-sdk-kernel that referenced this issue May 4, 2023
Related issue: networkservicemesh/deployments-k8s/issues/9119

Added mutex lock to protect table ID selection from parallel runs.

Signed-off-by: Laszlo Kiraly <laszlo.kiraly@est.tech>
@ljkiraly
Copy link
Contributor Author

ljkiraly commented May 4, 2023

Hi @NikitaSkrynnik, Thanks for suggestions. I misleading you I think. I was wrong in statement that the requests causes the problem. Most probably the NSE responses are processed in parallel. I proposed to fix this with a preventive lock before selecting and assigning table IDs.

ljkiraly added a commit to Nordix/nsm-sdk-kernel that referenced this issue May 25, 2023
Related issue: networkservicemesh/deployments-k8s/issues/9119

Added mutex lock to protect table ID selection from parallel runs.

Signed-off-by: Laszlo Kiraly <laszlo.kiraly@est.tech>
ljkiraly added a commit to Nordix/nsm-sdk-kernel that referenced this issue May 25, 2023
Related issue: networkservicemesh/deployments-k8s/issues/9119

Added a mechanism to protect table ID selection from parallel runs.

Signed-off-by: Laszlo Kiraly <laszlo.kiraly@est.tech>
ljkiraly added a commit to Nordix/nsm-sdk-kernel that referenced this issue Jun 14, 2023
Related issue: networkservicemesh/deployments-k8s/issues/9119

Added a mechanism to protect table ID selection from parallel runs.

Signed-off-by: Laszlo Kiraly <laszlo.kiraly@est.tech>
ljkiraly added a commit to Nordix/nsm-sdk-kernel that referenced this issue Jun 14, 2023
Related issue: networkservicemesh/deployments-k8s/issues/9119

Added a mechanism to protect table ID selection from parallel runs.

Signed-off-by: Laszlo Kiraly <laszlo.kiraly@est.tech>
@ljkiraly ljkiraly closed this as completed Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Development

No branches or pull requests

2 participants