You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.
This is meta-issue about useful metrics in bifrost-gateway.
We may ship only a subset of the below for the project Rhea.
Overview
The go-libipfs/gateway library will provide some visibility into incoming requests (1),
but we need to add metrics to track performance of saturn client (block provider) (2)
and other internals like resolution costs for different content path types and any in-memory caches we may add**(3)**.
graph LR
A(((fa:fa-person HTTP</br>clients)))
B[bifrost-gateway]
N[[fa:fa-hive bifrost-infra:<br>HTTP load-balancers<br> nginx, TLS termination]]
S(((saturn.pl<br>CDN)))
M0[( 0 <br>NGINX/LB<br/>LOGS&METRICS)]
M1[( 1 <br>HTTP<br/>METRICS:<br/> ipfs_http_*)]
M2[( 2 <br>BLOCK<br/>PROVIDER<br/>METRICS <br/>???)]
M3[( 3 <br>INTERNAL<br/>METRICS<br/>???)]
A -->| Accept: .. <br>?format=<br>Host:| N
N --> M1 --> B
N .-> M0
B --> M2 ---> S
B .-> M3
Loading
(0) are metrics tracked before bifrost-gateway and are out of scope.
Proposed metrics [WIP]
Below is a snapshot / brain dump. It is not ready yet, we want to make internal analysic/discussion before we start
For (1)
Per request type
Duration Histogram per request type
We want global variant, and per namespace (/ipfs/ or /ipns/)
See Appendinx below for example how histogram looks like
Why?
We need to measure each request types informed by ?format= and Accept header because
They have different complexity involved, and will have different latency costs
We want to be able to see which ones are most popular, and comparing _sum from histograms will allow us to see % distribution
We need to measure /ipfs/ and /ipns separately to see the impact additional resolution step (IPNS or DNSLink) has.
Response Size Histogram per request type
We want global variant, and per namespace (/ipfs/ or /ipns/)
Why?
Understanding what is the average response size allows us to correctly interpret the Duration. Without this, Duration of unixfs response does not tell us of file was big, or our stack is slow.
Open question (can be answered later, after we see initial counts) shoud we exclude these requests from totals? My initial suggestion is to exclude them. If they become popular, they will skew numbers, as request for 4GB file will be "insanely fast"
Count 200 vs 2xx vs 3xxx vs 400 vs 500 response codes
For (2)
Initially, we will only request raw blocks (application/vnd.ipld.raw) from Staurn:
Duration Histogram for block request
Response Size Histogram for block request
Count 200 vs non-200 response codes
TBD: Future (fancy application/vnd.ipld.car)
All requests will be for resolved /ipfs/
We will most likely want to track:
Duration and response size per original request type (histograms)
If we support sub-paths, then we will need to have to track Requested Content Path length (histogram)
TBD: if we put some sort of block cache in front of it, track HIT/MISS, probably per request type
For (3)
Place for additional internal metrics to give us more visibility into details, if we ever need to zoom-in.
Duration Histogram for /ipfs resolution
Why? Allows us to eyeball when resolution became the source of general slowness / regression in TTFB
Requested Content Path length Histogram for /ipfs
Why? We want to know % of direct requests for a CID vs
Duration Histograms for /ipns resolutions of DNSLink, IPNS Record, both single lookup and recursive until /ipfs/ is hit
Why?
bifrost-gateway will be delegating resolution to remote HTTP endpoint
Both can be recursive, so the metrics will be skewed unless we measure both a single lookup and recursive
We want to be able to see which ones are most popular, and how often recursive values are present. Comparing _sum from histograms will allow us to see % distribution.
Appendix: how histogram from go-libipfs/gateway look like
When I mean "histogram", I mean _sum and _buckets we use in Kubo's /debug/metrics/prometheus:
Click to expand
# HELP ipfs_http_gw_raw_block_get_duration_seconds The time to GET an entire raw Block from the gateway.
# TYPE ipfs_http_gw_raw_block_get_duration_seconds histogram
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.05"} 927
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.1"} 984
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.25"} 1062
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.5"} 1067
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="1"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="2"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="5"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="10"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="30"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="60"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="+Inf"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_sum{gateway="ipfs"} 19.696413685999993
ipfs_http_gw_raw_block_get_duration_seconds_count{gateway="ipfs"} 1068
We can change the bucket distribution if that gives us better data, but it should be done on both ends.
The text was updated successfully, but these errors were encountered:
This is meta-issue about useful metrics in bifrost-gateway.
We may ship only a subset of the below for the project Rhea.
Overview
The
go-libipfs/gateway
library will provide some visibility into incoming requests (1),but we need to add metrics to track performance of saturn client (block provider) (2)
and other internals like resolution costs for different content path types and any in-memory caches we may add**(3)**.
(0) are metrics tracked before
bifrost-gateway
and are out of scope.Proposed metrics [WIP]
Below is a snapshot / brain dump. It is not ready yet, we want to make internal analysic/discussion before we start
For (1)
?format=
andAccept
header because_sum
from histograms will allow us to see % distribution/ipfs/
and/ipns
separately to see the impact additional resolution step (IPNS or DNSLink) has.Cache-Control: only-if-cached
For (2)
Initially, we will only request raw blocks (
application/vnd.ipld.raw
) from Staurn:TBD: Future (fancy
application/vnd.ipld.car
)/ipfs/
TBD: if we put some sort of block cache in front of it, track HIT/MISS, probably per request type
For (3)
Place for additional internal metrics to give us more visibility into details, if we ever need to zoom-in.
/ipfs
resolution/ipfs
/ipns
resolutions of DNSLink, IPNS Record, both single lookup and recursive until/ipfs/
is hitbifrost-gateway
will be delegating resolution to remote HTTP endpoint_sum
from histograms will allow us to see % distribution.Appendix: how histogram from
go-libipfs/gateway
look likeWhen I mean "histogram", I mean
_sum
and_buckets
we use in Kubo's/debug/metrics/prometheus
:Click to expand
We can change the bucket distribution if that gives us better data, but it should be done on both ends.
The text was updated successfully, but these errors were encountered: