Replies: 3 comments 1 reply
-
|
|
Beta Was this translation helpful? Give feedback.
-
|
I believe that nodes:
suggested optimisation / high-load ready for RPC, API, JSON and HTTP requests in the Gonka project.
Replace verbose keys: "prompt" → "p", "max_tokens" → "mt", "model_id" → "m", "chain-rpc" → "cr", "inference" → "i" especially for non-human/nodes interactions. cutting 5-10-25 char on 1000 operations saving a lot on scale,
should bring on table 70-90% reduction in transfer size/tcp/cpu/processing load + latency/perf. Inference related next step.
[!TIP] Json-iterator/go: 2-3x faster than std, non-SIMD, compatible. For simple cases. |
Beta Was this translation helpful? Give feedback.
-
Why can't we just set multiple URLs in POC callback url and MLnodes will try them one by one ? Retry logic is already implemented, so it could be minor fix. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
High-availability / Fault-tolerance Problem
In the current deploy architecture, we have a bottleneck that a single issue with a
nodeorapicontainer can lead to failure of a multitude of MLNodes.There is a real need for redundancy in all parts of the system.
Current architecture of deploy has essentially a single instance of node container per api container
Any failures or lags of
nodecontainers (due to any reason like like overloading / hardware issue / ddos / etc) ofapicontainer essentially can easily lead to exclusion and miss of full reward per epoch for this validatorExamples include already observed network events, such as - Request DDOS,
chain-rpcattacksEven small lags of node container of active validator lead to inability to sign blocks => node becomes jailed and might be slashed
Node Instance problem
In current default deployment, a single instance of a chain node (
nodecontainer) is responsible for both:/chain-api,/chain-rpcrequests from the final users.Essentially, if any of the requests from users of the
apicontainer is heavy enough (or if there are a lot of them), it directly affects the performance of processing data from other peers (by any limits like CPU / disk IO, etc.). Which causes lags (nodes can’t be in sync with the chain) but also it doesn’t sign blocks in time which might cause the whole chain slow downs.Which essentially happened on the chain during last months.
Ideally, user requests must never directly hit the instance of the
nodecontainer which produces and signs blocks. In the same way they should not be able to affect the instance which is used by api to get the latest data from the chain.Sentries architecture
There is a well known solution to hide validators from direct access (both any APIs and P2P connections with other nodes to minimize chances of DDOS) called Sentry Deploy
Essentially the idea is to have several read-only (sentry) nodes, which communicate with the network and are visible for everyone but they don’t actively participate in the blocks producing / signing
The validator itself, which is responsible for producing and signing blocks lives in the internal network and doesn’t have public IP at all / any ports opened. Usually, such a node has all APIs disabled and also has disabled snapshots, it prunes everything and doesn't make any indexation. The goal is to achieve optimal performance for such containers.
Originally, that means any potential transactions which are gossiped to its mempool will be going through one of the Sentry nodes and will be rejected in advance if there are any significant issues in them.
Sentries + Cluster
The Sentry architecture is focused on protecting the validators mempool and does not take into account that
apicontainer must have access to up-to-date chain’s state and must also not be affected by any external requests.Node enhancement proposal
Based on this we propose architecture for the deployment of cluster, which uses the idea of Sentry node but additionally explicitly separates the instances which can be accessed externally (and their failures must not affect the whole validator) and the ones which are critical and should never be exposed. We’re trying to do this in a way when bigger nodes can add more instances / redundancy to handle more traffic.
The proposal suggests to use 2 pools of node instances:
1. Public Sentries
This is exactly the same idea of sentry nodes, which are used as a shield to protect validators. They all can have some short history of snapshots to still keep disk usage relatively low.
Only some of them should have enabled
/chain-apiand/chain-rpcand be available for user requests, (the user request can be redirected by load balancers to the node with enabled one).Snapshots are stored only on such nodes.
2. Private Cluster
The private cluster lives in an internal network and uses all public sentries as persistent peers. Nodes in the private cluster don’t have public interfaces and accept requests only from each other, sentry nodes and
apicontainer(-s).One of the nodes in Private Cluster is an active Validator for this host, it doesn’t have direct access to Consensus Key and sign blocks using signer (tmkms).
Host can’t have multiple Validator nodes at the same time to avoid double-signing by the same Consensus Private key. But if the Validator has any technical issues, any other nodes from the Private Cluster can be promoted to be the Validator. The promotion includes stopping the previous active validator and re-connect TMKMS (which has the Consensus Private Key) from the old one to the new one.
Note - The switchover itself is a point of failure. Additional health-checks and automation should be considered.
All private validators store only short latest history of states, don’t make snapshots and have aggressive pruning enabled
The
apicontainer get’s all data only from private nodes, they are considered to be up to date as long as at least one sentry node is in sync. If some of nodes in private nodes becomes unavailable, api container switches to the next available nodeNote - A separate enhancement is in
apicomponent to ensure that multiple nodes are used instead of one (One of the issues is not only the fact of unavailability, but also synchronisation state of nodes. The "catching-up" state of the node does not accurately represent the status. Thus, a check should be performed on the "freshness" of data. E.g.last_block_time > time()-10s)Both Public Sentries and Private Cluster can potentially be re-used if the same owner maintaining several independent validators (several consensus keys)
Api Instance problem
Currently, a single api instance is responsible for:
And if an instance has any issues due to the amount of inferences from the client, it can directly affect participating in POC and lead to validator exclusion and missing reward.
In addition - The PoC design currently expects that a single callback endpoint exists. It is also expected that a single instance is used to send PoC / validation artifacts to the network.
API enhancement proposal
There is no hard limitation to have only one
apicontainer at the time. Roles of theapicontainer can be separated to - MLNode / POC Manager, vs Public Inference. Idea is to never expose Manager externally, but have automatic scaling of the roles which handles client inference requests (since requests are stateless and there is no requirement on request “stickyness”)The MLNode / POC Manager component.
Two distinct solutions can be considered:
Solution 1 (Infrastructure level fault resolution)
Introduce multiple containers that reside under the same Virtual IP (single callback endpoint)
Pros:
Cons:
apicontainers to ensure that no double requests are sent to MLNodes (PoC mechanism in MLNode is known to refuse double requests)Solution 2 (Application level fault resolution)
The POC Manager can follow a “Primary / replica” approach and have a “connection pool” defined as callbacks for MLNode. The API nodes select/vote for a “primary” manager that is used to handle PoC requests. In case of failure, api nodes re-elect the “primary” manager to handle requests.
In such case - The MLNodes are free to publish PoC and validation results to any API container from the connection pool.
Pros:
Cons:
Notes on both solutions:
References
Beta Was this translation helpful? Give feedback.
All reactions