Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP collisions when using more than 1 replica of icmp-responder #1999

Open
d-uzlov opened this issue Jul 1, 2021 · 2 comments
Open

IP collisions when using more than 1 replica of icmp-responder #1999

d-uzlov opened this issue Jul 1, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@d-uzlov
Copy link
Contributor

d-uzlov commented Jul 1, 2021

Overview

When we set replicas value for an nse-kernel deployment to more than 1, we create several identical endpoints (=endpoints with the same configuration).

When client makes several requests, and these requests go to different endpoints with the same configuration, these endpoints can give the client the same IP addresses, and second request will overwrite the connection of the first request.
While we could call this an invalid configuration in case of completely distinct endpoints, from user side there is nothing to be done when using replicas, so this must be a valid configuration.

The issue is present when the following preconditions are met:

  1. There are at least 2 endpoints with the same IP prefix in config
  2. A client makes at least 2 requests, which can go the those endpoints

Context

When making scalability tests I slightly modified icmp responder to register itself for several services and set 2 replicas, and added many connections to client config.
I immediately stumbled upon instability in tests: sometimes clients were getting all required connections, sometimes they were few connections short. This was caused, as explained above, by the fact that 1 client went to 2 servers and got the same IP for 2 distinct connections, and second connection overwrote the first.

Possible solutions

Solution 1

We have the excluded_prefixes field in request connection context, which we can use to pass the list of occupied IP addresses, so the endpoint doesn't try to overwrite them.

However, this solution assumes that we know occupied addresses beforehand.
But if we were to make requests from several threads, we wouldn't know which addresses could be taken by other threads.
I'm not sure if it would be possible to solve this without some kind of global mutex in the NSC, which would prevent parallel requests.

Solution 2

Maybe we could have some synchronization in the moment of creating a connection.
At the time of writing this I haven't researched how hard it would be to add such synchronization, and which exact components we would need to change or create.

Also I would think that maybe there is some method of non-destructive interface assignment, so that we would get an error on an attempt to use already occupied IP address instead of silently overwriting previous connection.

Solution 3

We could add some synchronization for endpoints.
Imagine we had some kind of registry, that would hold the data of already used IP prefixes.
Endpoints could query this registry on startup, so when we create several replicas of an endpoint, each instance would get its own IP prefix, so they just wouldn't have the same config.

Intermediate conclusion

Solution 1 seems simple but not universal to me.
Solution 2 seems very promising, but I'm yet to verify how hard (if possible) it would be to implement.
Solution 3 would probably require quite a lot of changes inside all of the endpoints, and we would also introduce a new concept, which would increase complexity of the system, and I'm really not sure it is justified.

@d-uzlov d-uzlov added the bug Something isn't working label Jul 1, 2021
@d-uzlov d-uzlov self-assigned this Jul 1, 2021
@d-uzlov d-uzlov added this to To do in Issue/PR tracking Jul 1, 2021
@denis-tingaikin denis-tingaikin moved this from To do to Backlog in Issue/PR tracking Jul 1, 2021
@denis-tingaikin
Copy link
Member

I think it is a good issue to consider corner cases of using nsm.
Currently, it seems to me that this issue has low priority so put it into Backlog.

@denis-tingaikin
Copy link
Member

cc @edwarnicke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Issue/PR tracking
  
Backlog
Development

No branches or pull requests

2 participants