Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue: Support for pingora-load-balancing #39

Open
jamesmunns opened this issue Jun 3, 2024 · 10 comments
Open

Tracking Issue: Support for pingora-load-balancing #39

jamesmunns opened this issue Jun 3, 2024 · 10 comments
Labels
F-Configuration Functionality relating to configuration F-ServiceDiscovery Functionality relating to Service Discovery of Upstream Servers F-Upstream Functionality relating to the Upstream/Connector interfaces

Comments

@jamesmunns
Copy link
Collaborator

The pingora-load-balancing crate actually provides three key features related to managing upstreams:

  • Load Balancing, e.g. allowing for sharing all incoming requests between multiple upstream servers
  • Health Checks, e.g. ensuring all upstream servers are healthy or live, before proxying requests to them
  • Service Discovery, e.g. updating the list of upstream servers over time

river will be integrating pingora-load-balancing in order to add these features. This post discusses what does and does not exist today.

Load Balancing

Load balancing functionality is largely contained within the selection module.

There are currently four supported algorthms:

  • Consistent (Ketama) hashing
  • FNV hashing
  • Random
  • Round Robin

All of these algorithms support weighting of different backends. TODO: I'm unsure how weighting will play with health checks and service discovery

It is likely we will need/want more load balancing algorithms in the future. For example, NGINX supports quite a few.

To begin with, we will want to add support for all four currently supported algorithms, selected on a per-service basis. Configuration will need to include weighting cost for statically selected server lists.

Health Checks

Health check functionality is largely contained within the health_check module.

There are currently two supported methods of health check:

  • TCP Health checks, which creates a TCP (or TLS) connection in order to establish health
  • HTTP Health checks, which sends an HTTP(s) connection in order to establish health

Both methods have some basic shared configuration, such as hysteresis for healthy/unhealthy determination, as well as the connection information for each peer.

Additionally, HTTP health checks have a quite a few more options, including what to request from the server, how to validate the response, and additional checks like whether a TCP/TLS connection can be re-used (this makes checking more efficient, but can mask firewall or routing issues).

We will want to support both options, some effort is expected for determining reasonable configuration APIs for the health checks.

Service Discovery

Service discovery functionality is largely contained within the discovery module.

Currently, the only provided algorithm is "Static", which uses a fixed array of backends (e.g. "no discovery").

There does exist a ServiceDiscovery trait that allows for implementing and providing service discovery functionality.

We have a couple of required service discovery abilities in river's requirements. We may want to implement DNS-SD or polling of SRV records, though that may not occur during the initial implementation.

Summary

This tracking issue will likely cover multiple PRs for implementing this functionality. We'll need to design quite a bit of configuration surface for these new features, and it will likely be a breaking change from the existing "one listener" configuration file.

We should also cross check the requirements after implementation that pingora-load-balancing has sufficient API surface to allow for the things we've mentioned, such as DNS TTL tracking to invalidate hosts.

@jamesmunns jamesmunns added F-Upstream Functionality relating to the Upstream/Connector interfaces F-ServiceDiscovery Functionality relating to Service Discovery of Upstream Servers F-Configuration Functionality relating to configuration labels Jun 3, 2024
@jamesmunns jamesmunns added this to the Kickstart Spike 2 milestone Jun 3, 2024
@jamesmunns jamesmunns pinned this issue Jun 3, 2024
@jamesmunns
Copy link
Collaborator Author

Notes on configuration file contents:

  • "Connectors" (general config, superceding the current "Connector" config)
    • proxy_addr (addr+port, maybe also allow hostname+port?)
    • tls_sni (optional, hostname)
    • weight (integer? relative weighting?)
  • Selection
    • "kind"/"flavor" (currently: Consistent/FNV/Random/RoundRobin)
    • These algorithms don't currently have individual configurables, but it's likely future
      items will.
    • We should probably define a default, maybe RoundRobin?
  • Health Checks
    • "kind"/"flavor" (currently None/TCP/HTTP)
    • General Settings
      • consecutive_successes (integer, "on" hysteresis)
      • consecutive_failures (integer, "off" hysteresis)
      • peer_options (a lot, see PeerOptions)
    • Flavor: TCP
      • none? address/sni are repeated, but probably covered by connector options
    • Flavor: HTTP
      • reuse_connection (fresh TCP/TLS on every check?)
      • req (request header, path and header k:v)
      • validator (Function for validating response).
        • TODO: We will need to define potential options for this!
      • port_override (integer - if we want to use another port for health check)
      • address/sni: not needed? covered by connector options?
      • proxy (TODO: I don't know what this is used for?)
      • client_cert_key (should we specify this as part of the Connector config? is this a reasonable override?)
      • scheme (HTTP/HTTPS - is this already covered by the TLS/SNI setting?)
  • Discovery
    • Right now the only option is "static", we will want to reserve nested configuration like we will for health checks.

@moderation
Copy link

Cool to see work on load balancing. This is a key feature and I was surprised initially when I saw connector as a struct and not an array - https://github.com/memorysafety/river/blob/main/docs/toml-configuration.md?plain=1#L22-L24 👍

@jamesmunns
Copy link
Collaborator Author

I've been working on configuration a bit, as I think there will be quite a bit of configuration surface for this. You can see the changes here: #42 (comment)

@jamesmunns
Copy link
Collaborator Author

I've been looking at this the past couple of days, and I hadn't realized that the various pingora-load-balancing traits were not Object Safe.

I also took a bit of a look to see how straightforwards this would be to work around, but the ability for trait implementors to create new instances, generally for the purpose of updating via ArcSwap items, is pretty fundamental to the current API.

I have a chat scheduled today with the pingora engineers, but I can see the outcomes being one of the following:

Work with pingora team to make traits object safe

This would be a pretty significant API change for pingora-load-balancing, but would allow us to use dyn Trait to load the implementations at runtime, based on the content of the configuration file.

Write object safe workarounds for existing p-l-b traits

This would entail wrapping the existing trait implementors with a hand-rolled version of dynamic dispatch, rather than use dyn Trait directly. This could be an enum of the different options, or something more invasive or unsafe.

This would allow us to avoid changing p-l-b, but might incur additional performance costs, as well as maintenance overhead. This is likely not a blocker, but might be relevant as this dispatching would be required on a relatively "hot" path (ProxyHttp::upstream_peer()).

Don't use p-l-b

We could define our own traits that are similar to p-l-b, but are object safe, or work in a way that is suitable for River's needs.

This would make it more challenging to take newer algorithms and options from pingora, as well as making it challenging to upstream any we create for River.

Summary

I'll update after speaking with the pingora dev team.

@mcpherrinm
Copy link

I see you've started work, so I'm interested to hear how the Pingora discussion went and if you've got an idea which direction you're going (no pressure if you're not ready to share yet!)

@jamesmunns
Copy link
Collaborator Author

Hey @mcpherrinm! Thanks for the reminder!

The outcome was noted here: cloudflare/pingora#275 (comment)

But the short answer is "I was too zoomed in" - It actually wasn't necessary to make the P-L-B items object-safe, since we already have type erasure one level up, as pingora takes (type erased) dyn Service items.

@jamesmunns
Copy link
Collaborator Author

If anyone is interested in how I handle the "trickery" here, the biggest parts are:

First: Adding the generics to the RiverProxyService, here:

impl<BS> RiverProxyService<BS>
where
BS: BackendSelection + Send + Sync + 'static,
BS::Iter: BackendIter,
{

Then: A function that picks the right "type" of service based on the appropriate selection algorithm, here:

/// Create a proxy service, with the type parameters chosen based on the config file
pub fn river_proxy_service(
conf: ProxyConfig,
server: &Server,
) -> Box<dyn pingora::services::Service> {
// Pick the correctly monomorphized function. This makes the functions all have the
// same signature of `fn(...) -> Box<dyn Service>`.
type ServiceMaker =
fn(ProxyConfig, &Server, RequestSelector) -> Box<dyn pingora::services::Service>;
let (service_maker, selector): (ServiceMaker, RequestSelector) =
match conf.upstream_options.selection {
SelectionKind::RoundRobin => (
RiverProxyService::<RoundRobin>::from_basic_conf,
null_selector,
),
SelectionKind::Random => (RiverProxyService::<Random>::from_basic_conf, null_selector),
SelectionKind::Fnv => (RiverProxyService::<FVNHash>::from_basic_conf, null_selector),
SelectionKind::Ketama => (
RiverProxyService::<KetamaHashing>::from_basic_conf,
null_selector,
),
};
service_maker(conf, server, selector)
}

@mcpherrinm
Copy link

Neat, thanks!

@jamesmunns
Copy link
Collaborator Author

Short progress update here:

@jamesmunns
Copy link
Collaborator Author

I've gone ahead and merged #46, which corrected the loose ends in #45.

There are still quite a few configurables that need to be "wired up" for health checking and load balancing. I may go ahead and focus on static page hosting to be able to set up a "fleet" of river instances for easier testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
F-Configuration Functionality relating to configuration F-ServiceDiscovery Functionality relating to Service Discovery of Upstream Servers F-Upstream Functionality relating to the Upstream/Connector interfaces
Projects
None yet
Development

No branches or pull requests

3 participants