IP collisions when using more than 1 replica of icmp-responder #1999

d-uzlov · 2021-07-01T08:46:28Z

Overview

When we set replicas value for an nse-kernel deployment to more than 1, we create several identical endpoints (=endpoints with the same configuration).

When client makes several requests, and these requests go to different endpoints with the same configuration, these endpoints can give the client the same IP addresses, and second request will overwrite the connection of the first request.
While we could call this an invalid configuration in case of completely distinct endpoints, from user side there is nothing to be done when using replicas, so this must be a valid configuration.

The issue is present when the following preconditions are met:

There are at least 2 endpoints with the same IP prefix in config
A client makes at least 2 requests, which can go the those endpoints

Context

When making scalability tests I slightly modified icmp responder to register itself for several services and set 2 replicas, and added many connections to client config.
I immediately stumbled upon instability in tests: sometimes clients were getting all required connections, sometimes they were few connections short. This was caused, as explained above, by the fact that 1 client went to 2 servers and got the same IP for 2 distinct connections, and second connection overwrote the first.

Possible solutions

Solution 1

We have the excluded_prefixes field in request connection context, which we can use to pass the list of occupied IP addresses, so the endpoint doesn't try to overwrite them.

However, this solution assumes that we know occupied addresses beforehand.
But if we were to make requests from several threads, we wouldn't know which addresses could be taken by other threads.
I'm not sure if it would be possible to solve this without some kind of global mutex in the NSC, which would prevent parallel requests.

Solution 2

Maybe we could have some synchronization in the moment of creating a connection.
At the time of writing this I haven't researched how hard it would be to add such synchronization, and which exact components we would need to change or create.

Also I would think that maybe there is some method of non-destructive interface assignment, so that we would get an error on an attempt to use already occupied IP address instead of silently overwriting previous connection.

Solution 3

We could add some synchronization for endpoints.
Imagine we had some kind of registry, that would hold the data of already used IP prefixes.
Endpoints could query this registry on startup, so when we create several replicas of an endpoint, each instance would get its own IP prefix, so they just wouldn't have the same config.

Intermediate conclusion

Solution 1 seems simple but not universal to me.
Solution 2 seems very promising, but I'm yet to verify how hard (if possible) it would be to implement.
Solution 3 would probably require quite a lot of changes inside all of the endpoints, and we would also introduce a new concept, which would increase complexity of the system, and I'm really not sure it is justified.

The text was updated successfully, but these errors were encountered:

denis-tingaikin · 2021-07-01T10:07:01Z

I think it is a good issue to consider corner cases of using nsm.
Currently, it seems to me that this issue has low priority so put it into Backlog.

denis-tingaikin · 2021-07-01T10:07:26Z

cc @edwarnicke

d-uzlov added the bug Something isn't working label Jul 1, 2021

d-uzlov self-assigned this Jul 1, 2021

d-uzlov added this to To do in Issue/PR tracking Jul 1, 2021

denis-tingaikin moved this from To do to Backlog in Issue/PR tracking Jul 1, 2021

d-uzlov mentioned this issue Jul 13, 2021

Scalability Testing: System NSM testing #1015

Open

yuraxdrumz mentioned this issue Mar 7, 2022

connectServer Request fails with context canceled error during refresh period networkservicemesh/sdk#1068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP collisions when using more than 1 replica of icmp-responder #1999

IP collisions when using more than 1 replica of icmp-responder #1999

d-uzlov commented Jul 1, 2021 •

edited

Loading

denis-tingaikin commented Jul 1, 2021

denis-tingaikin commented Jul 1, 2021

IP collisions when using more than 1 replica of icmp-responder #1999

IP collisions when using more than 1 replica of icmp-responder #1999

Comments

d-uzlov commented Jul 1, 2021 • edited Loading

Overview

Context

Possible solutions

Solution 1

Solution 2

Solution 3

Intermediate conclusion

denis-tingaikin commented Jul 1, 2021

denis-tingaikin commented Jul 1, 2021

d-uzlov commented Jul 1, 2021 •

edited

Loading