Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [Remote Store] /_remotestore/stats API and _nodes/stats API enhancements for observability on Remote Translog Store upload operations #8311

Closed
BhumikaSaini-Amazon opened this issue Jun 28, 2023 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage

Comments

@BhumikaSaini-Amazon
Copy link
Contributor

BhumikaSaini-Amazon commented Jun 28, 2023

Note: This RFC will be updated to incorporate feedback as received in the community discussion below

Table of Contents


Context

Is your feature request related to a problem? Please describe.

Aligning with #6789, we should be able to query statistics for Remote Translog Store (RTS)-related upload operations.

Describe the solution you'd like
This RFC proposes the addition of new statistics for observability on the upload flow of RTS operations. To support this, changes in the existing /_remote_store/stats API contract are also proposed.


Changes in the existing /_remotestore/stats API contract

  1. The stats related to Remote Segment Store (RSS) and Remote Translog Store (RTS) would be tracked under distinct, new keys named segments and translog.
  2. New keys named upload and download will be introduced under the segments and the translog keys. These will track the stats related to the upload and download flows respectively. Flow-agnosting stats, if any, pertaining to RSS and RTS would be introduced directly under the segments and translog keys respectively.
  3. As a consequence of point 1 and point 2 above, the existing stats for RSS upload flow will be moved under the segments.upload level. New stats for the RSS download flow would be introduced under the segments.download level.
  4. As a consequence of point 1 and point 2 above, the new stats for RTS upload flow will be moved under the translog.upload level. New stats for the RTS download flow would be introduced under the translog.download level.
  5. The RTS stats for upload flow, download flow, as well as any flow-agnosting stats would inherently be only for the primary copy of a given shard.
  6. The RSS upload stats would inherently be only for the primary copy of a given shard.
  7. The RSS download stats would have a breakdown of download stats per replica shard copy.
  8. As a consequence of point 5, point 6, and point 7 above, when queried at the index level on a given node, stats related to the following will only be returned for the shards for which the node is the primary:
    a. RSS upload flow
    b. All RTS stats
  9. As a consequence of point 5, point 6, and point 7 above, when queried at the index level on a given node, stats related to the following will be returned for all shards of the index on the node:
    a. RSS download flow
  10. If the queried index is not RTS-enabled, the translog object will not be returned. Only the segments object and the relevant metadata (i.e. the shard_id) will be returned.

Statistics to be introduced for RTS uploads

Visibility on local vs. RTS diff

  1. lag
    Represents the number of translog operations not persisted to RTS. This would be relevant for async translog durability.

  2. last_upload_timestamp
    Represents the last successful RTS upload epoch timestamp. This wouldn’t change to the timestamp of the last RTS upload operation if the respective upload fails.

Totals

  1. total_uploads
    Represents the total number of RTS uploads. Eligible sub-fields (based on operation status): started, succeeded, failed.

  2. total_uploads_in_bytes
    Represents the total number of bytes uploaded to the RTS. Eligible sub-fields (based on operation status): started, succeeded, failed.

  3. total_upload_time_in_millis
    Represents the total time spent on RTS uploads.

Performance

  1. upload_size_in_bytes
    Represents the size of data to be uploaded to RTS. Eligible sub-fields: moving_avg.

  2. upload_speed_in_bytes_per_sec
    Represents the speed of RTS uploads in bytes per second. Eligible sub-fields: moving_avg.

  3. upload_latency_in_millis
    Represents the time taken by RTS upload. Eligible sub-fields: moving_avg.


API design

Base Path

GET /_remotestore/stats

Supported path parameters

  1. Name of RTS-enabled index (required)
  2. Shard ID for RTS-enabled index (optional)

Supported query parameters

  1. local - Retrieves stats only for the shards on the coordinating node.
  2. Default (no parameters) - Retrieves stats for all the shards of the index across the participating nodes.

Shard-level stats for RTS-enabled index

Path:

GET /_remotestore/stats/<index>/<shardId>

Response:

{
    "shard_id" : "[my-index-1][0]",
    "segments": {
       <RSS flow-agnostic stats here>
       "upload" : {
            "refresh_time_lag_in_millis": 5727,
            "refresh_lag": 1,
            "bytes_lag": 0,
            "backpressure_rejection_count": 0,
            "consecutive_failure_count": 0,
            "total_remote_refresh": {
                "started": 57,
                "succeeded": 56,
                "failed": 0
            },
            "total_uploads_in_bytes": {
                "started": 1568138701,
                "succeeded": 1568138701,
                "failed": 0
            },
            "remote_refresh_size_in_bytes": {
                "last_successful": 12705142,
                "moving_avg": 32766119.75
            },
            "upload_latency_in_bytes_per_sec": {
                "moving_avg": 25523682.95
            },
            "remote_refresh_latency_in_millis": {
                "moving_avg": 990.55
            }
        },
       "download" : [
            <new RSS download flow stats here>
        ]
    },
    "translog": {
       <RTS flow-agnostic stats here>
       "upload" : {
            "lag": 2,
            "last_upload_timestamp": 1687941312,
            "total_uploads": {
                "started": 98,
                "succeeded": 96,
                "failed": 0
            },
            "total_uploads_in_bytes": {
                "started": 246465,
                "succeeded": 236647,
                "failed": 0
            },
            "total_upload_time_in_millis": 900,
            "upload_size_in_bytes": {
                "moving_avg": 236.75
            },
            "upload_speed_in_bytes_per_sec": {
                "moving_avg": 211.95
            },
            "upload_latency_in_millis": {
                "moving_avg": 70.55
            },
        },
       "download" : [
            <new RTS download flow stats here>
        ],
    }
}

Index-level stats for RTS-enabled index

Path:

GET /_remotestore/stats/<index>

Response:

{
    {
        "shard_id" : "[my-index-1][0]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        },
        "translog": {
            <RTS flow-agnostic stats here>
            "upload" : {
                "lag": 2,
                "last_upload_timestamp": 1687941312,
                "total_uploads": {
                    "started": 98,
                    "succeeded": 96,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 246465,
                    "succeeded": 236647,
                    "failed": 0
                },
                "total_upload_time_in_millis": 900,
                "upload_size_in_bytes": {
                    "moving_avg": 236.75
                },
                "upload_speed_in_bytes_per_sec": {
                    "moving_avg": 211.95
                },
                "upload_latency_in_millis": {
                    "moving_avg": 70.55
                },
            },
            "download" : [
                <new RTS download flow stats here>
            ],
        }
    },
    
    ...,
    
    {
        "shard_id" : "[my-index-1][N]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        },
        "translog": {
            <RTS flow-agnostic stats here>
            "upload" : {
                "lag": 2,
                "last_upload_timestamp": 1687941312,
                "total_uploads": {
                    "started": 98,
                    "succeeded": 96,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 246465,
                    "succeeded": 236647,
                    "failed": 0
                },
                "total_upload_time_in_millis": 900,
                "upload_size_in_bytes": {
                    "moving_avg": 236.75
                },
                "upload_speed_in_bytes_per_sec": {
                    "moving_avg": 211.95
                },
                "upload_latency_in_millis": {
                    "moving_avg": 70.55
                },
            },
            "download" : [
                <new RTS download flow stats here>
            ],
        }
    }
}

Shard-level stats for RTS-disabled but RSS-enabled index

Path:

GET /_remotestore/stats/<index>/<shardId>

Response:

{
    "shard_id" : "[my-index-1][0]",
    "segments": {
       <RSS flow-agnostic stats here>
       "upload" : {
            "refresh_time_lag_in_millis": 5727,
            "refresh_lag": 1,
            "bytes_lag": 0,
            "backpressure_rejection_count": 0,
            "consecutive_failure_count": 0,
            "total_remote_refresh": {
                "started": 57,
                "succeeded": 56,
                "failed": 0
            },
            "total_uploads_in_bytes": {
                "started": 1568138701,
                "succeeded": 1568138701,
                "failed": 0
            },
            "remote_refresh_size_in_bytes": {
                "last_successful": 12705142,
                "moving_avg": 32766119.75
            },
            "upload_latency_in_bytes_per_sec": {
                "moving_avg": 25523682.95
            },
            "remote_refresh_latency_in_millis": {
                "moving_avg": 990.55
            }
        },
       "download" : [
            <new RSS download flow stats here>
        ]
    }
}

Index-level stats for RTS-disabled but RSS-enabled index

Path:

GET /_remotestore/stats/<index>

Response:

{
    {
        "shard_id" : "[my-index-1][0]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        }
    },
    
    ...,
    
    {
        "shard_id" : "[my-index-1][N]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        }
    }
}

Related information

  1. [Draft] Identify stats for remote store feature #6789
  2. [RFC] [Remote Store] Remote Store Stats API #7153
  3. https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/

@BhumikaSaini-Amazon BhumikaSaini-Amazon added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 28, 2023
@BhumikaSaini-Amazon BhumikaSaini-Amazon changed the title [RFC] [Remote Store] /_remote_store/stats API enhancements for observability on Remote Translog Store upload operations [RFC] [Remote Store] /_remotestore/stats API enhancements for observability on Remote Translog Store upload operations Jun 28, 2023
@anasalkouz anasalkouz added Storage:Durability Issues and PRs related to the durability framework and removed untriaged labels Jun 28, 2023
@Bukhtawar Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023
@BhumikaSaini-Amazon BhumikaSaini-Amazon changed the title [RFC] [Remote Store] /_remotestore/stats API enhancements for observability on Remote Translog Store upload operations [RFC] [Remote Store] /_remotestore/stats API and _nodes/stats API enhancements for observability on Remote Translog Store upload operations Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage
Projects
None yet
Development

No branches or pull requests

4 participants