[proposal] Scaling Orc8r subscribers codepath #5382
Replies: 2 comments 1 reply
-
A few years ago we experimented w/ change data capture at the DB layer to compute materialized views of mconfigs and some bulk read endpoints. Turns out Kafka is a nightmore to maintain for a tiny eng team so I killed it and built configurator instead because like, transactions are pretty cool. But I always had it in the back of my head that some change streaming pattern is necessary to get subscriber sync to scale. So if you want to continue caching all subs at the edge, maybe you can look into properly streaming deltas to AGW's. This was one of my bucket list "one day" things. But maybe stay away from proper message busses and just use the DB for now until that falls over. I had this idea of a hook at the configurator storage layer to write out a streaming log of changes to specific entity types to some table. Then your AGW's populate their locally materialized (persistent) view of subscribers by reading that log from the last offset they consumed. Instead of constantly syncing a whole bunch of subs over the network, now you can just apply small deltas when things change. Of course you still need a paginated resync API and some work around detecting state drift. Since the part of the subscriber that AGWs care about is fairly static, this should be a huge win in reducing management plane network traffic. Some light related reading: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/, https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying |
Beta Was this translation helpful? Give feedback.
-
Hi guys, I tried somewhat different approach for the local subscribers table with some success, and I like to discuss the ramifications of this idea. In my tests I´ve create 300 networks of 50k subscribers each, giving us 15M total subscribers. In the AGW I´ve created an replica of all cfg_entities table, synchronized in real-time using the native LISTEN / NOTIFY mechanism from PostgreSQL and a native libpq client replacing subscriberdb and connecting directly on database to feed a local SQLITE3 database with, after the initial load, only the delta. This scenario was tested with a Orc8r within PostgreSQL database running in a VM with 4GB RAM and 2 cores. The AGW is a baremetal with a Celeron J3160, 4GB of RAM and 128GB SSD. The client in the AGW consumes less than 10MB of RAM to run and can serve 16k queries per second locally. Each simultaneous connection to the servers consumes about 8MB RAM, so we can extrapolate to use about only 3.2GB RAM from the server side to server 400 AGW. This approach make some sense? Or using the gRPC approach is must? |
Beta Was this translation helpful? Give feedback.
-
Scaling Orc8r subscribers codepath
With general patterns for economical set-interfaces.
> Original FB Quip doc
Authors
Requested reviewers
Overview
Summary
Orchestrator functionality currently drops dead around 18k subscribers, hindering our 1.5 release plans. Further, under the current architecture, our GA scalability targets would result in at least 850 terabytes sent over the network every month, necessitating a reexamining of our subscriber architecture. This document focuses on resolving scale issues for the subscriberdb codepath. We go on to describe a selection of tunable, increasingly effective mechanisms for improving the scalability of Magma’s set-interfaces.
We describe 4 objectives in this document
See the end of this document for resulting followup tasks.
tl;dr
Open questions
Context
Overview
Magma’s chosen to propagate the “set-interface” pattern, where we set the full state on each update (snapshot-based), rather than sending a change list (diff-based). This pattern is desirable for stability and ease-of-implementation purposes, but it comes with the expected side effect of scalability issues on certain high-volume codepaths.
As a starting point, @karthiksubraveti is guiding the standing up of a repeatable scale-testing process, where we can get automated scalability metrics as part of a recurring CircleCI job. From there, this document describes additional patterns the dev team can use in the push to improve the scalability statistics.
The highest-volume codepaths are those directly or indirectly tied to the set of subscribers. Currently Orc8r supports ~18k subscribers, with drop-dead cutoffs. As a starting point, we need to remove the drop-dead cutoffs in favor of implementations that degrade gradually.
After solving the drop-dead cutoffs, Orc8r also needs to support true set-interfaces for codepaths that include order of 20k objects of non-trivial size. Based on some rough estimates, we can use 250 bytes as the baseline size for an average subscriber object. This results in needing to handle ~25mb of serialized objects on every get-all request. For reference, gRPC’s default incoming message size limit is 4mb.
To put it another way, consider a 10-network Orc8r deployment, where each network supports our GA target of 400 gateways per network and 20k subscribers per network. Under our current set-interface patterns, that Orc8r will be sending 1040020k*250 = 20gb of subscriber objects over the network every minute. Multiply that by 43800 minutes in a month, and that’s over 850 terabytes per month, just on subscriber objects.
Clearly this is an inadmissible number. This document describes patterns for moving this number to a manageable size while retaining the benefits of the set-interface. To start, consider each contributor to this final number
From this listing, it’s clear the only non-trivial resolution to our GA scaling needs under a set-interface pattern must come with intelligently sending a trivial subset of subscribers on each request, rather than sending the full subscriber set.
This document focuses on the subscribers codepath for concreteness, but the described patterns should generalize to the subscriber-adjacent codepaths such as metrics ingestion and exporting.
Existing architecture
For an overview of Orc8r’s architecture, along with per-service functionality, see the Architecture Overview documentation.
Subscriber objects
For context on the subscriber codepath: when we say “subscriber objects”, we refer to the
SubscriberData
protobuf. This proto contains subscriber-specific information, and each of these in a network is sent down to every gateway in the network, in-bulk, as the starting point for how the gateway should service a subscriber.For context to the calculations in the previous section: a scale-testing issue raised by a partner indicated they hit gRPC max message size issues (4194304 bytes i.e. 4mb) at 18k subscribers => 4194304 / 18k = 233 bytes in basic scale testing => 250 bytes as baseline for average production subscriber object size.
Subscriber assumptions
The set of subscribers in a network is expected to follow a strongly read-dominated pattern. That is, updates to the set of subscribers are expected to be infrequent. To contextualize subsequent design recommendations, we’ll examine the following variously-representative operator behaviors, where for simplicity we only consider mutations, where a single mutation refers to a single subscriber object that needs to be re-sent down to gateways
Consistency
Orc8r makes certain consistency guarantees which need to be upheld. Of specific concern are the consistency guarantees for configs along the northbound and southbound interfaces. The config codepath flows unidirectionally, from northbound (REST API), through the configurator service, and eventually out southbound (gateway gRPC). That is, northbound is many-readers-many-writers, while southbound is many-readers-zero-writers. To align on terminology, we’ll use terms from this Werner Vogels article on eventual consistency.
Existing guarantees
Considered alternatives
Edge-triggered subscriber push. This document assumes we want to retain the set-interface pattern throughout Orc8r — for context, see level-triggered (state) vs. edge-triggered (events). A principal alternative to combat our scalability issues is to back away from this high bar and instead go with a hybrid approach, where e.g. we send state-based snapshots at infrequent intervals to correct any issues with the more-frequent event-based updates. However, we believe we can achieve reasonable scalability metrics without resorting to edge-triggered designs.
☠️ Objective 1: resolve drop-dead scale cutoffs
Overview
For both northbound and southbound interfaces (NMS and gateways, respectively), a get-all request reads all subscriber objects in a network, optionally manipulates them, and passes the result to the caller. This pattern drops dead in two immediate areas
Prepared statement contains too many placeholders
[immediate fix] Configurable max gRPC message size
As a starting point, we can’t catch every scaling issue during development. We need to provide operators with recourse for emergent scaling issues, whether they’re on the subscriber codepath or some other. The simplest solution is to provide every orc8r service a config value to set the max gRPC message size.
Implementation options: we can also consider having a single, cross-service config file, to avoid having to replicate (and remember to replicate for new services) the config value across service config files.
[medium-term solution] Pagination
Google provides a great description on how to paginate listable gRPC endpoints in their API Design Guide. We’ll reproduce their recommendations with Magma-specific considerations. This pattern is also flexible across gRPC and REST endpoints. Key tenets include
For concreteness, we’ll examine the
ListSubscribers
gRPC endpoint. But the concepts are generalizable to the rest of the subscriber get-all codepath.The configurator service stores entities with primary key
(network_id, type, key)
. All requests are under a particular network ID, all subscribers have the same type, and subscriber IMSI is the key. This means a list of subscribers can be ordered on their primary key by their IMSI.This allows us to use the fantastically scalable seek method for paginating DB requests. For example,
For configurator-specific implementation, we can use
EntityLoadCriteria
to pass in page token and page size, andEntityLoadResult
to return next page token.Specifically for the subscriber codepaths, we need to
Change to consistency guarantees. Under a paginated subscriber polling model, the full set of subscribers can be updated concurrently with an iterative request for the full set of subscriber pages. This slightly alters our consistency guarantees, but we should still have sufficiently-strong guarantees under the paging pattern. We note the difference between consistency on the “full set of subscribers” vs. consistency for “any particular subscriber.”
the full set ofindividual subscribers than one you have previously readSerializability: a gateway can read a set of subscribers that never existed as the outcome from a previous write. However, this is admissible because (a) individual subscriber configs are updated atomically and (b) the full set of subscribers is still eventually consistentthe full set ofindividual subscriber objects. As above, we lose the serializability guarantee for the full set of subscriber objects. However, we can still guarantee strong consistency on a per-object basis, meaning read-your-writes, monotonic read, and monotonic write consistency all apply on a per-subscriber basisFor both northbound and southbound interfaces, this change from “full set of subscribers” guarantees to “any particular subscriber” should not cause correctness issues, as a coherent view of the network-wide set of subscribers is not required for correct behavior — just coherence on a per-subscriber basis. That is, we only update subscribers on a per-subscriber basis, and never in-bulk — i.e., we update the single-subscriber object rather than the full-set-of-subscribers object. So reading an incoherent view of the full set of subscribers won’t cause any additional correctness issues because we don’t provide a way to write that full set of subscribers atomically.
[considered alternative] Server-side streaming
gRPC also supports streaming constructs, where server-side streaming is an attractive option for this use-case. However, we view server-side streaming as inferior to pagination due to the following issues
Since the subscriber get-all codepath is easier to implement via pagination at the DB side, and needs to support pagination on the northbound interface, we can avoid unnecessary complexity by making the entire codepath paginated rather than a mix of pagination and streaming.
Upshot
💾 Objective 2: reduce DB pressure at scale
Overview
Scale tests have noted DB timeout issues for subscriber get-all requests, specifically for the configurator service. This is likely due to lock contention within the DB, as a get-all request is made 1 time per minute for each gateway in the network and 2 times per minute for each NMS instance open in a browser. For our GA scale targets of 400 gateways, this results in 4,032,000 DB get-all requests across a 7-day period. Caching is a straightforward solution here.
Side note: pagination through to the REST API should reduce the NMS-induced DB pressure, but this is a minority contributor in the expected case.
[immediate fix] Increase default subscriber streamer sync interval
Current default is 1 minute. Moving this to 3 or 5 minutes would give us immediate breathing room for dealing with DB pressure.
[immediate fix] Shorter DB timeouts
In the scale testing setup, experiment with adding more aggressive timeouts to all application DB requests — and this timeout should be shorter than the frequency with which the subscriber stream is updated. This may help prevent the DB from getting bogged down under excessive request weight, instead clearing outdated requests to ease the path toward non-degraded functionality.
Another consideration is relaxing transaction isolation levels. Unfortunately, configurator already has relaxed isolation levels — it doesn’t use serializable isolation except on table creation, and it uses Postgres-default Read Committed isolation for entity reads.
[medium-term solution] Singleton get-all caching service
Add a single-pod
subscriber_cache
service which makes the get-all DB request at configurable interval, e.g. every 30 seconds, generates the subscriber objects, and exposes them over a paginated gRPC endpoint. Subscriberdb can then forward subscriber get-all requests tosubscriber_cache
rather than calling to configurator then manually constructing the objects.We can get away with caching like this on the southbound interface because gateway subscriber information is only eventually consistent, and our implicit SLA is currently order of 60 seconds.
Note: for simplicity we describe this as a separate, singleton service. But it doesn’t actually need to be a separate service or singleton, and instead we could inject this functionality directly into the subscriberdb service, with some DB-based synchronization for determining the most-recent digest, to avoid churn caused by gateways alternating connections between subscriberdb pods with different cached views of the subscriber objects. However, a benefit of the singleton approach is it provides an easy way to get a globally-consistent view of the full set of subscribers, side-stepping inconsistent digest issues discussed in subsequent sections.
[considered alternative] Redis cache
This solution is not worth the complexity it would entail at this point, where we’re just trying to get to 50-100k subscribers per network with eventual consistency in replicating down the southbound interface. However, a Redis cache in front of the configurator data is a reasonable long-term option if the single-pod
subscriber_cache
service becomes a bottleneck due to increased scaling requirements or need for more granular, write-aware caching.Scalability calculations
To contextualize the proposed changes in this section, consider the representative operator behaviors presented above, with a single-tenant Orc8r deployment with GA targets of 20k subscribers and 400 gateways over a 7-day period. We’ll assume the southbound cache refresh occurs every 5 minutes, which means
Upshot
☔️ Objective 3: reduce network pressure at scale
Overview
We want to retain the set-interface aspect of the southbound subscriber codepath, and the easiest way to handle this is already used within Orc8r — protobuf digests. Go’s proto third-party package supports deterministic protobuf serialization. We can leverage this to generate digests of the subscriber objects, allowing a short-circuit of the full network transmission in the common case.
An additional, synergistic option is modeling gateways as read-through subscriber caches. Then, instead of the Orc8r sending the full set of subscribers to every gateway, each gateway instead receives only a requested subset of subscribers, updated on-demand. This would alter some existing Magma affordances, specifically around “it just works” headless gateway operation, but the scalability wins may make for a worthwhile trade-off.
[medium-term solution] Flat digest of all subscribers
Similar to the way mconfig requests take an
extraArgs *any.Any
argument from the client (which is then coerced to a digest), endpoints on the subscriber get-all codepath can take aprevious_digest string
parameter. If the client’s digest matches the server’s digest, ano_updates
boolean can short-circuit the request.For performance purposes, the
subscriber_cache
service can compute and expose the digest each time it updates its cacheSome followup notes
subscriber_cache
service resolves this issue in practice, providing, if not serializable, at least a globally-consistent viewsubscriber_cache
service can compute the digest multiple times per cache refresh, storing the set of digests rather than just the single digest. Clients can still send a single digest, and servers check that digest against the list. This is hacky but potentially feasible[long-term solution] Tree-based digest: Merkle radix tree
If the assumption of minimal subscriber updates is violated, or we scale past viability of the flat digest pattern, we can gracefully transition to a tree-based digest pattern. This will allow arbitrarily-high resolution understandings of which subscriber objects need to be sent over the network. And since this pattern is resilient to lost updates, we retain the benefits of the set-interface pattern.
Context: Merkle tree. Merkle trees allow tree-based understanding of which data block (in our case, subscriber object) has mutated. See this Merkle tree visualization to get an intuitive understanding. With two Merkle trees, you can start with the root nodes, compare, then recurse to each child until you find one of the following
Context: Radix tree. Prefix trees support stable, deterministic arrangement of objects based on string keys. In their compressed form they’re called radix trees, which is what we’ll want to use. See this radix tree visualization to get an intuitive understanding.
Solution: Merkle radix trees. Combining Merkle and radix trees allows us to organize the full set of subscribers into a stable, IMSI-keyed tree (radix tree), then recursively generate digests for each node up the tree (Merkle tree). This unlocks arbitrarily fine-grained control over how many subscriber objects must be sent, even under high subscriber write volume. Note that, per our Swagger spec, an IMSI can contain between 10 and 15 digits, inclusive. This results in a maximum tree depth of 15 and maximum branching factor of 10, and due to the nature of radix trees the full tree size will only be a function of the total number of subscribers.
Some followup notes
[long-term solution] Configurable gateway subscriber views
Current architecture has a single, global view of the full set of subscribers, and each gateway receives a full copy of that global view. However, we don’t necessarily need to have a single view — instead, we can have the Orc8r view (global, all subscribers) and a new, per-gateway view (non-strict subset). Supporting gateway-wise views unlocks sending only a subset of subscribers to a particular gateway, independent of whether we include the digest pattern, while retaining set-interface benefits.
From an implementation perspective, we can support a per-network, REST-API-configurable “gateway view mode”, with the following 3 modes, where we can default to the read-through cache mode.
Identical, global view. This is the current pattern. Every gateway receives all subscribers, in an eventually-consistent manner. Supports headless gateways and mobility across headless gateways.
Static mapping. Operators manually map subscribers to gateways. Useful for restricting subscribers to a particular CPE, for example.
Gateway as read-through cache. Gateways only pull subscribers on-demand. Supports headless gateways but not mobility across headless gateways.
[considered alternative] Configurator entity-type versioning
One medium-term alternative to digest-based caching is to outfit configurator with an atomic counter, per entity type, which is incremented on every mutation to an entity of that type. This can be exposed to the read endpoints, allowing readers to determine with low resolution whether there has been a change to any entity of the chosen type. Instead of storing and sending digests, gateways can include relevant version numbers as a drop-in replacement — that is, since the digest has no semantic meaning to clients, we can literally place the version number into the digest fields mentioned previously.
One consideration for this solution is that subscriberdb pulls multiple entity types when it constructs the subscriber protos. The version number actually exposed to clients would need to be a function of the full set of version numbers consumed from configurator by subscriberdb.
We slightly prefer the digest-based pattern, but are open to feedback. Downsides to the versioning solution include
[considered alternative] 3GPP tracking areas
Rather than sending all subscribers in a network to all gateways, we can use the 3GPP concept of tracking areas to map which subscribers should be sent to which gateways. However, this approach is insufficient as we can’t make strong assumptions on how operators set and manage their tracking areas, and there’s no guarantee that a UE won’t travel to a gateway outside their tracking area.
Resilience considerations
Non-serializable subscribers view. We expect quite infrequent writes to the set of subscriber configs, meaning most “get all subscribers” requests will happen to retrieve a serializable view of the subscribers. Additionally, tree-based digest patterns allow sending exactly the set of subscribers that constitute the delta between Orc8r and gateway. So even though a subscriberdb pod’s non-serializable view of the full set of subscribers may occasionally result in slightly mismatched digest trees, the resulting network pressure will be minimal since only the delta between the two digest trees needs to be transmitted on the subsequent poll. Finally, using the previously-described
subscriber_cache
pattern as a singleton service largely resolves this issue by affording a globally-consistent, if still non-serializable, subscriber view. With this effectively globally-consistent view, only a trivial subset of all requests will see an inconsistent view of the full set of subscribers, removing the potential negative of often-mismatched digests.Software bugs. Digest-based patterns rely on gateways appropriately storing and including digests in their requests to the southbound subscribers endpoint. Consider two scenarios.
Scenario A. If a misbehaving gateway polls the endpoint in rapid succession, with garbage digests, the Orc8r could easily get DoS’d and/or accrue excessive opex costs. Safety latches will need to be built into the system to impede such misbehaving gateways.
Scenario B. We can also imagine a partial bug in how an Orc8r calculates digests. This would cause similar outcomes, where the Orc8r is attempting to pump exact copies of its subscriber objects out to most or all polling gateways, driving up opex costs and leading to potential self-DoS issues.
In both scenarios, network pressure improvements are either partially lost (scenario A) or up to completely lost (scenario B). The principal prospects for ameliorating these failure modes are
The latter solution doesn’t resolve scenario A, but it does restrict the scope of opex and self-DoS issues in scenario B.
Scalability calculations
To contextualize the proposed changes in this section, consider the representative operator behaviors presented above, with a single-tenant Orc8r deployment with GA targets of 20k subscribers and 400 gateways over a 7-day period. Under this context, the existing architecture results in (1440*7)(400)(20000) = 80 billion subscriber objects sent over the network over a 7-day period — that’s 20 terabytes. For the below calculations, we’ll assume the new default southbound poll interval is 5 minutes, and that each gateway serves a uniform share of the subscribers (i.e. 50 subscribers at each gateway) with no subscriber mobility
Full tree: (1440/57)(400)(3mb) overhead + (17)(400)(500*250b) from subs = 2tbNote that for target use-cases, the proposed multi-round tree pattern in conjunction with read-through cache architecture invariably keeps 7-day network usage under 250mb for a single-tenant Orc8r at our GA scale targets, an 80,000x improvement even under worst-case assumptions.
Upshot
💬 Objective 4: simplify Orc8r-gateway interface
Overview
Due to the immediate need for pagination of the get-all subscriber endpoints, gateways need to be updated to gracefully handle these updated endpoints by greedily consuming all pages in an update. Additionally, the southbound subscriber interface will need to provide affordances to support the read-through cache functionality.
We currently use the streamer pattern to send all subscribers as a data update to the
subscriberdb
stream. We have generic code on the gateway that handles receiving updates from a stream and acting based on the received updates. This entire codepath (Orc8r and gateway) would need an update to handle exhaustively reading from the stream, since currently we assume all data updates are contained in a single update batch. This would cause collateral risk to every streaming functionality in the Magma platform: mconfigs, subscribers, policies, rating groups, etc., most of which do not and likely will not need support for multi-page reads.[short-term] Migrate to unary RPC
Instead, the preferred option is to begin the already-initiated process of moving away from the streamer pattern. It’s a legacy pattern from when we considered support for pushing event-triggered changes down to the gateways. However, with the refocusing toward set-interfaces, and the fact that we have zero Orc8r support whatsoever for event-based stream triggers (streams are just updated on a timed loop), the streamer pattern loses out on its main positives. Instead, we’re left with an unnecessary layer of indirection between Orc8r and gateways, with the added complexity of having to deal with gRPC streams for no specific benefit.
As an aside, the current streamer pattern handling at both Orc8r and gateways doesn’t actually make use of the server-side streaming functionality built in to the streamer pattern — we only ever make single requests then close the steam. So we’re effectively doing “unary” gRPC calls with extra cruft.
That to say, rather than retrofitting the streamer pattern, we can instead fully move to a pull-based model for polling the subscriber codepath. Benefits of this refactoring approach include
Additional considerations
Orc8r-controlled poll frequency. We can give Orc8r control over the update frequency by adding an mconfig field to relevant services to set the gateway’s polling frequency, with chosen default value. Then each mconfig update (1 minute) is an opportunity for an Orc8r to tune a gateway’s subscriber update frequency.
Upshot
Conclusion: followup tasks
v1.5
/lte/networks/{network_id}/gateways/{gateway_id}/subscribers
endpoint, fully removing the need for NMS to make get-all subscriber requests. May need to use state indexers pattern to implement this endpointv1.6
Non-goals
Beta Was this translation helpful? Give feedback.
All reactions