Make service discovery more reliable #80

claustres · 2020-11-04T10:48:30Z

We faced multiple issues related to service discovery like eg #49 or #1.

Now we only publish services on initialization or when a new node is added. However, there is no guarantee that the remote app is already initialized and ready to register it when published. We have added some configuration options like COTE_DELAY or PUBLICATION_DELAY to mitigate this but a more reliable way might be possible.

For instance, we could publish services on a regular basis, like a heartbeat.

The text was updated successfully, but these errors were encountered:

nathanbrizzee · 2021-06-15T21:22:31Z

Any ideas on how / when a fix for this might come about? We are noticing that when we bring up two servers (different services in each to get to a more microservices like architecture), when one of the servers goes down (force killed for testing), the other server is not aware of it and still tries to send events to the offline server. Is there a way that when a server goes down, the registered services could be un-registered? I like your heartbeat idea. When a server stops responding to heartbeat, its services are un-registered. Just a thought. Thanks for the excellent work on this library!

claustres · 2021-06-16T08:59:30Z

Hi, I think what you are talking about is a different beast than the one of this issue. This issue is more about the fact that

we rely on cote for service discovery but we stack an additional Feathers app layer on top of it and need to control the initialization order otherwise our Feathers app might not be ready to handle cote events
each app is operated autonomously and we don't control the initialization order of each one so that one might publish its services while others are not ready to handle it as well

However, I've just added some documentation about the example. I've notably detailed the Docker compose test setup so that you can try a similar situation of a replicated service going down and see how it performs. If the example works but not your use case please fill a dedicated issue in, with as much information as possible so that we can reproduce/help.

kabnfever · 2021-06-18T03:20:41Z

Hi Luc, love this package, suits our needs very well, with the one exception that Nathan noted. You are correct that the issue Nathan brings up is different from what you propose in this issue. I do agree with your proposal to “publish services on a regular basis, like a heartbeat” because using the delay variables become a guessing game as your number of servers and microservices grow.

That being said, I agree with Nathan regarding looking into service removal or deregistration because the concept of service discovery and its reliability covers both service additions and removals. In the feathers-distribution package it is fully instrumented for service & application “additions”, but only a FIXME placeholder has been created for service “removals” and as the associated comment states: “FIXME: we should manage apps going offline”, you were aware that this was going to need to be addressed eventually…well eventually is here 😊. In any service architecture, whether they be microservices or not, servers go offline for various reasons and for varying lengths of time. The services that were offered by an offline server should not be available to the gateway server or any other server in the feathers-distributed subnet for use. Doing so provides a false-positive of available services and unsuspecting apps (e.g. gateway apps) would call services that are actually not available.

The cote package emits the removal event (cote:remove) that the feathers-distribution package is a subscriber to as your placeholder code indicates. If I knew enough details of what your feathers-distribution package was doing upon service & application registration & publishing, I’d offer a solution on how to deregister services and applications, but I don’t. Hoping you will take this on. Again, thank you for this great package!

claustres · 2021-06-18T12:41:12Z

Hi, thanks for this detailed feedback. First I would like to avoid any confusion by defining a "microservice" as a functional service that can be implemented by N replicas of a Feathers service "hosted" by N Feathers apps (as services are unique in Feathers apps) and possibly deployed on M servers. For instance, an e-commerce web app can have an order microservice hosted by 3 apps to support the load and deployed on 2 servers. feathers-distributed aims at abstracting these details by exposing the order microservice as a standard Feathers service in the web app, which could bring confusion, I must confess. Maybe a REST API analogy is better in this case, exposing a microservice is almost like exposing an order endpoint on the web app domain.

Having said that, from the functional perspective of the web app there is no such use case where the order microservice can come and go randomly IMHO. This service MUST be available in order to make the app work correctly, otherwise, it's a bug and should not be considered as a nominal operational mode. You can have a web app resilient to such a problem by displaying some message if the microservice does not respond but users cannot order anyway, so it will not really help. Using the REST API analogy makes it maybe clearer, it does not make sense to make the endpoint simply disappear IMHO, it always exists but could simply be temporarily down. That's the reason why managing the use case where the microservice goes down is not so important, by using replication we actually want to achieve exactly the opposite: make the microservice resilient to failures and be ALWAYS available.

I've just completed the example section of the documentation to explain what will be the current behavior in case of replicas failure from the gateway point of view. If no services are available to serve microservice requests then a timeout will occur. By managing the use case where all replicas go down what we could possibly do is to send back a 404 instead of a timeout, but I'm not sure it is worth the additional complexity and it will not solve the functional issue anyway.

It seems to me that it is more interesting to get a separated health check of a given microservice, that's why we actually implemented #79, so that you are aware of any problem and could possibly automate mitigation/failover operations. But I don't really get the use case of microservices coming and going randomly from a functional point of view. If the goal is to handle microservice failures it seems to me that it would add a level of complexity to make microservices registering/unregistering dynamically during the app life cycle, while it should be delegated to a lower layer (in our case cotejs and mitigation/failover actions) that ensure availability. As illustrated by this issue originally, it's already hard to manage service sync on initialization!

I hope my explanations are clear, however, I might be wrong and will let you explain what you think in more details ;-)

william-seaton · 2021-10-05T16:07:30Z

I think one of the issues with this approach is that if you're working Kubernetes for instance with Auto Scaling, you can easily end up in a situation where nodes are appearing and disappearing which is why this is so critical to have. If you get a lot of load and go from 2 replicates to 10 then later it drops back down to 2 there'd be an issue no?

claustres · 2021-10-05T17:05:39Z

As I've already tried to explain no matter if the number of replicas is 2, then 10, then 2 again, from the application perspective the service has always been "on" and cotejs should handle this transparently for us. An issue may arise if the number of replicas drop down to zero, as the current implementation will still believe that the service does exist but when trying to contact it cotejs requests will timeout. A cleaner implementation should probably send back a 404 in this case instead of a timeout but anyway this will not make the app work properly, that's why I personally don't think about it as mandatory. However, any PR is welcome on the subject.

Initially, this issue has been created for a race condition that might occur at startup when the main app and the remote service are created simultaneously and might miss discovery events, but I am afraid it is currently slipping to an issue about service replicas going down/up during the app life cycle. I would be happy if someone could initiate a new specific issue with a use case indicating why it is important to track service replicas life cycle for availability or any relevant reason. Thanks for all your feedback but IMHO we need a clearer view of what is actually expected. Maybe what is confusing is that the idea about publishing services on a regular basis, like a heartbeat, could tackle both issues but with a lower (feathers service) or higher (microservice) point of view.

claustres · 2024-03-28T13:37:27Z

@nathanbrizzee @kabnfever @william-seaton In any case you are still interested we have implemented something like you were asking for in #129, still under test if you'd like to provide us with any feedback.

claustres added the enhancement label Nov 4, 2020

claustres mentioned this issue Oct 17, 2022

Question: New Version #107

Closed

claustres closed this as completed in 5acad4b Oct 17, 2022

claustres mentioned this issue Mar 26, 2024

Improve reliability with faulty apps #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make service discovery more reliable #80

Make service discovery more reliable #80

claustres commented Nov 4, 2020

nathanbrizzee commented Jun 15, 2021

claustres commented Jun 16, 2021

kabnfever commented Jun 18, 2021

claustres commented Jun 18, 2021 •

edited

william-seaton commented Oct 5, 2021

claustres commented Oct 5, 2021 •

edited

claustres commented Mar 28, 2024 •

edited

Make service discovery more reliable #80

Make service discovery more reliable #80

Comments

claustres commented Nov 4, 2020

nathanbrizzee commented Jun 15, 2021

claustres commented Jun 16, 2021

kabnfever commented Jun 18, 2021

claustres commented Jun 18, 2021 • edited

william-seaton commented Oct 5, 2021

claustres commented Oct 5, 2021 • edited

claustres commented Mar 28, 2024 • edited

claustres commented Jun 18, 2021 •

edited

claustres commented Oct 5, 2021 •

edited

claustres commented Mar 28, 2024 •

edited