Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Add metrics specific to bifrost-gateway setup #12

Open
lidel opened this issue Feb 6, 2023 · 0 comments
Open

Add metrics specific to bifrost-gateway setup #12

lidel opened this issue Feb 6, 2023 · 0 comments

Comments

@lidel
Copy link
Collaborator

lidel commented Feb 6, 2023

This is meta-issue about useful metrics in bifrost-gateway.
We may ship only a subset of the below for the project Rhea.

Overview

The go-libipfs/gateway library will provide some visibility into incoming requests (1),
but we need to add metrics to track performance of saturn client (block provider) (2)
and other internals like resolution costs for different content path types and any in-memory caches we may add**(3)**.

graph LR
    A(((fa:fa-person HTTP</br>clients)))
    B[bifrost-gateway]
    N[[fa:fa-hive bifrost-infra:<br>HTTP load-balancers<br> nginx, TLS termination]]
    S(((saturn.pl<br>CDN)))
    M0[( 0 <br>NGINX/LB<br/>LOGS&METRICS)]
    M1[( 1 <br>HTTP<br/>METRICS:<br/> ipfs_http_*)]
    M2[( 2 <br>BLOCK<br/>PROVIDER<br/>METRICS <br/>???)]
    M3[( 3 <br>INTERNAL<br/>METRICS<br/>???)]


   A -->| Accept: .. <br>?format=<br>Host:| N


    N --> M1 --> B
    N .-> M0
    
    B --> M2 ---> S
    B .-> M3
Loading

(0) are metrics tracked before bifrost-gateway and are out of scope.

Proposed metrics [WIP]

Below is a snapshot / brain dump. It is not ready yet, we want to make internal analysic/discussion before we start

For (1)

  • Per request type
    • Duration Histogram per request type
      • We want global variant, and per namespace (/ipfs/ or /ipns/)
      • See Appendinx below for example how histogram looks like
      • Why?
        • We need to measure each request types informed by ?format= and Accept header because
          • They have different complexity involved, and will have different latency costs
          • We want to be able to see which ones are most popular, and comparing _sum from histograms will allow us to see % distribution
        • We need to measure /ipfs/ and /ipns separately to see the impact additional resolution step (IPNS or DNSLink) has.
    • Response Size Histogram per request type
      • We want global variant, and per namespace (/ipfs/ or /ipns/)
      • Why?
        • Understanding what is the average response size allows us to correctly interpret the Duration. Without this, Duration of unixfs response does not tell us of file was big, or our stack is slow.
  • Count GET vs HEAD requests
    • Per each, count requests with Cache-Control: only-if-cached
      • Open question (can be answered later, after we see initial counts) shoud we exclude these requests from totals? My initial suggestion is to exclude them. If they become popular, they will skew numbers, as request for 4GB file will be "insanely fast"
  • Count 200 vs 2xx vs 3xxx vs 400 vs 500 response codes

For (2)

  • Initially, we will only request raw blocks (application/vnd.ipld.raw) from Staurn:

    • Duration Histogram for block request
    • Response Size Histogram for block request
    • Count 200 vs non-200 response codes
  • TBD: Future (fancy application/vnd.ipld.car)

    • All requests will be for resolved /ipfs/
    • We will most likely want to track:
      • Duration and response size per original request type (histograms)
      • If we support sub-paths, then we will need to have to track Requested Content Path length (histogram)
  • TBD: if we put some sort of block cache in front of it, track HIT/MISS, probably per request type

For (3)

Place for additional internal metrics to give us more visibility into details, if we ever need to zoom-in.

  • Duration Histogram for /ipfs resolution
    • Why? Allows us to eyeball when resolution became the source of general slowness / regression in TTFB
  • Requested Content Path length Histogram for /ipfs
    • Why? We want to know % of direct requests for a CID vs
  • Duration Histograms for /ipns resolutions of DNSLink, IPNS Record, both single lookup and recursive until /ipfs/ is hit
    • Why?
      • bifrost-gateway will be delegating resolution to remote HTTP endpoint
      • Both can be recursive, so the metrics will be skewed unless we measure both a single lookup and recursive
      • We want to be able to see which ones are most popular, and how often recursive values are present. Comparing _sum from histograms will allow us to see % distribution.

Appendix: how histogram from go-libipfs/gateway look like

When I mean "histogram", I mean _sum and _buckets we use in Kubo's /debug/metrics/prometheus:

Click to expand
# HELP ipfs_http_gw_raw_block_get_duration_seconds The time to GET an entire raw Block from the gateway.
# TYPE ipfs_http_gw_raw_block_get_duration_seconds histogram
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.05"} 927
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.1"} 984
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.25"} 1062
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.5"} 1067
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="1"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="2"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="5"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="10"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="30"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="60"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="+Inf"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_sum{gateway="ipfs"} 19.696413685999993
ipfs_http_gw_raw_block_get_duration_seconds_count{gateway="ipfs"} 1068

We can change the bucket distribution if that gives us better data, but it should be done on both ends.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@lidel and others