Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot time speed up & metrics #1

Merged
merged 30 commits into from Apr 24, 2023
Merged

Boot time speed up & metrics #1

merged 30 commits into from Apr 24, 2023

Conversation

jorgecuesta
Copy link

@jorgecuesta jorgecuesta commented Mar 31, 2023

High-Level Summary

This update includes changes that allow for faster boot time and better overall performance & maintainability.

  • Mesh bootstrap time has been greatly reduced to almost 0, which improves overall performance and makes it easier to maintain your key files.
  • Mesh now supports hot reloading of keys and chains efficiently by deduplicating full nodes, which allows for faster updates and better overall performance.
  • Code has been split into smaller files to make it easier to navigate and modify, which helps with both development and maintenance.
  • The Relay, Relay Time, and Error metrics now use labels for chain id and name to differentiate between different chains and to help with identifying and categorizing the various metrics.
  • Misc bug fixes such as connection re-use and invalid sessions which should increase the performance of your mesh and full node.

Breaking Changes

This changes require you update both: Pocket Node & Mesh Node

  1. Added /v1/private/mesh/check endpoint
  2. Moved /v1/mesh/health to /v1/private/mesh/health
  3. Removed /v1/private/mesh/servicer
  4. Removed Prometheus metrics using chain id as part of the metric name
  5. Removed authtoken query param from private mesh endpoints in favor of Authorization: <token> header.
    • NOTE: No changes need on your side unless you have some tooling built over /v1/private/mesh/* endpoints

NOTE:

  • You must add the new private endpoints (check) and use the same code/image on both sides or your mesh node will not start.
  • You could remove the /v1/health endpoint from your proxy.

Configuration Changes

  1. Added node_check_interval to configure the rate at a full Node is health checked.
  2. Added log_chain_request and log_chain_response to avoid unnecessary debug logs.
  3. Added chains_name_map and remote_chains_name_map to enhance exposed metrics with chain name.
  4. Added metrics_moniker to help to identify a mesh node instance on metrics queries.
  5. Added metrics_report_interval to configure the rate at metrics about workers are reported.
  6. Added <servicer|chain>_rpc_max_idle_connections
  7. Added <servicer|chain>_rpc_max_conns_per_host
  8. Added <servicer|chain>_rpc_max_idle_conns_per_host
  9. Added chain_drop_connections
  10. Added client_rpc_read_timeout
  11. Added client_rpc_read_header_timeout
  12. Added client_rpc_write_timeout
  13. Added chain_request_path_cleanup
  14. Removed hot_reload_interval in favor of keys_hot_reload_interval and chains_hot_reload_interval as separated values
  15. Rename worker pool options to specify a different set for Servicer or Metrics
  16. Remove rpc_timeout in favor of: client_rpc_timeout and chain_rpc_timeout

Core Changes

New:

  1. New keys structure
  2. Deduplication of nodes (reduce bandwidth usage and faster boot time)
  3. Allow control RPS of mesh to a node (using servicer_max_workers) due to how workers are used now
  4. Chains check. Check against the Pocket node if the chains id on the chain file matches the one on Pocket Node.
  5. Added a Worker for metrics
  6. Chains name maps (local or remote) to enhance metrics
  7. Added endpoint to allow mesh node check against servicer node (health, chains, addresses)
  8. Added sanitization of the URL before call chain due to errors observed on chains like Avax/DFK sending \t characters on the path.

Fix:

  1. Invalid sessions error
  2. HTTP connection drops due to avoiding read response body (decreases CPU utilization and bandwidth)
  3. Handle a few cases where after an error it keeps going without interrupting the flow, invalidating a relay/session.
  4. Update chains using /v1/private/mesh/updatechains endpoint

Rework:

  1. Relay, Relay Time, and Error metrics now use labels for chain id and name
  2. Split Keys and Chains hot reload interval to allow them to work independently.
  3. Moved the authtoken query parameter used to call private method to be passed using the HTTP header Authorization to avoid it being leaked on logs.
  4. Split code in files to help development and readability of it.
  5. Bootstrap time reduce to almost 0

Dependencies:

  1. Added xsync to have better performance that the native for concurrent access.
  2. Added gojsonschema to allow validate keys and chains name map format.
  3. Added golang-set to manage set of values in arrays faster and async safe
  4. Bump version github.com/alitto/pond from v1.8.1 to v1.8.3
  5. Bump version github.com/prometheus/client_golang from v1.11.0 to v1.11.1

Bump Version

New version: RC-0.3.0

Docs

  1. Update mesh.doc
  2. Added missing links to dockerhub
  3. Added Grafana dashboard to consume the new metrics.
  4. Updated rpc-spec.yaml

Added different key format.
Refactor connectivity checks.
Refactor node/servicer internal structure of mesh to reduce amount of worker/cron instances.
Refactor chains/keys reload.
@jorgecuesta jorgecuesta self-assigned this Mar 31, 2023
@jorgecuesta jorgecuesta added the enhancement New feature or request label Mar 31, 2023
app/cmd/rpc/mesh.go Outdated Show resolved Hide resolved
app/cmd/rpc/mesh.go Outdated Show resolved Hide resolved
app/cmd/rpc/mesh.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@nodiesBlade nodiesBlade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I don't have much insights to add. File loading/configuration can only be done so concisely. As long as it works!

Added FullNode worker dynamic resize on servicers change.
Updated servicers reload to only run the modification on maps when there is something new/removed.
@jorgecuesta
Copy link
Author

@PoktBlade more changes on this, please check the latest commit

…e and better readability of the code without so many casts.

Refactor fullNode.Servicer to be a map instead of a slice.
Enhance a bit more the logs and bootstrap time information.
Added metrics config support.
Refactor code to split in files.
Bump pond version to 1.8.3 (patch).
Clean up the code.
@jorgecuesta jorgecuesta changed the title Initial rework to speed up bootstrap times of mesh client. Boot time speed up & metrics Apr 5, 2023
Update config to handle rpc timeout for different things like chains, client and pocket node calls with a different value.
Ensure that http response body is read even on errored request to reuse connections.
Enhanced chains reload logs.
Enhanced startup logs.
… so many edge cases and possible infinite goroutine spams.

Added name property to nodes as optional key, if not set use the hostname of the node url.
Added minWorker, maxWorker, maxCapacity to prometheus metrics collectors.
Refactor minWorker, maxWorker and maxCapacity option in config.
Bump default to a more real world value.
Updated docs.
…pact on mesh code, but help node runners to keep internal track.
…/queries on prometheus.

Added chains name map so those metrics could contain the chain name you wish.
ChainsNameMap could work with a local file or remote endpoint (GET)
… error log. Does not affect the code but is unnecessary.
Added status_type and status_code labels to error metrics.
Added internal, notify and chain error metrics.
Added metrics docs and basic geo-mesh grafana dashboard.
doc/specs/mesh.md Outdated Show resolved Hide resolved
…ould be leaked on logs.

Moved Grafana Dashboard to a file to easily compare/copy from raw github files.
Updated mesh.md
Enhance docs based on blade suggestions.
@nodiesBlade
Copy link
Collaborator

nodiesBlade commented Apr 24, 2023

High Level Summary

This update includes changes that allow for faster boot time and better overall performance & maintainability.

  • Mesh bootstrap time has been greatly reduced to almost 0, which improves overall performance and makes it easier to maintain your key files.
  • Mesh now supports hot reloading of keys and chains efficiently by duplicating full nodes, which allows for faster updates and better overall performance.
  • Code has been split into smaller files to make it easier to navigate and modify, which helps with both development and maintenance.
  • The Relay, Relay Time, and Error metrics now use labels for chain id and name to differentiate between different chains and to help with identifying and categorizing the various metrics.
  • Misc bug fixes such as connection re-use and invalid sessions which should increase the performance of your mesh and full node.

Breaking Changes

This changes require you update both: Pocket Node & Mesh Node

  1. Added /v1/private/mesh/check endpoint
  2. Moved /v1/mesh/health to /v1/private/mesh/health
  3. Removed /v1/private/mesh/servicer
  4. Removed Prometheus metrics using chain id as part of the metric name

NOTE: You must add the new private endpoints (check & health), or your mesh node will not start.

Configuration Changes

  1. Added node_check_interval to configure the rate at a full Node is health checked.
  2. Added log_chain_request and log_chain_response to avoid unnecessary debug logs.
  3. Added chains_name_map and remote_chains_name_map to enhance exposed metrics with chain name.
  4. Added metrics_moniker to help to identify a mesh node instance on metrics queries.
  5. Added metrics_report_interval to configure the rate at metrics about workers are reported.
  6. Added support for Servicer and Chain http.Transport options:
  7. <servicer|chain>_rpc_max_idle_connections
  8. <servicer|chain>_rpc_max_conns_per_host
  9. <servicer|chain>_rpc_max_idle_conns_per_host
  10. Removed hot_reload_interval in favor of keys_hot_reload_interval and chains_hot_reload_interval as separated values
  11. Rename worker pool options to specify a different set for Servicer or Metrics
  12. Remove rpc_timeout in favor of: client_rpc_timeout and chain_rpc_timeout

Core Changes

New:

  1. New keys structure
  2. Deduplication of nodes (reduce bandwidth usage and faster boot time)
  3. Allow control RPS of mesh to a node (using servicer_max_workers) due to how workers are used now
  4. Chains check. Check against the Pocket node if the chains id on the chain file matches the one on Pocket Node.
  5. Added a Worker for metrics
  6. Chains name maps (local or remote) to enhance metrics
  7. Added endpoint to allow mesh node check against servicer node (health, chains, addresses)

Fix:

  1. Invalid sessions error
  2. HTTP connection drops due to avoiding read response body (decreases CPU utilization and bandwidth)
  3. Handle a few cases where after an error it keeps going without interrupting the flow, invalidating a relay/session.
  4. Update chains using /v1/private/mesh/updatechains endpoint

Rework:

  1. Relay, Relay Time, and Error metrics now use labels for chain id and name
  2. Keys hot reload
  3. Chains hot reload
  4. Moved the authtoken query parameter used to call private method to be passed using the HTTP header Authorization to avoid it being leaked on logs.
  5. Split code in files to help development and readability of it.
  6. Bootstrap time reduce to almost 0

Dependencies:

  1. Added xsync to have better performance that the native for concurrent access.
  2. Added gojsonschema to allow validate keys and chains name map format.
  3. Added golang-set to manage set of values in arrays faster and async safe
  4. Bump version github.com/alitto/pond from v1.8.1 to v1.8.3
  5. Bump version github.com/prometheus/client_golang from v1.11.0 to v1.11.1

Bump Version

New version: RC-0.3.0

Docs

  1. Update mesh.doc
  2. Added missing links to dockerhub
  3. Added Grafana dashboard to consume the new metrics.
  4. Updated rpc-spec.yaml

--

I made some changes to the release notes, could you elaborate more on the rework of:

2. Keys hot reload
3. Chains hot reload

Copy link
Collaborator

@nodiesBlade nodiesBlade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think this is an iterative step architecturally and code-wise to improve the quality of Mesh. This is obviously not its final form and plenty of more code quality & architectural improvements down the road.

amazing work @jorgecuesta , I imagine a future where the entire network is leveraging mesh in some form or way

@jorgecuesta jorgecuesta merged commit def24c3 into geo-mesh Apr 24, 2023
@jorgecuesta jorgecuesta deleted the initial-enhancement branch April 24, 2023 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants