Skip to content

Latest commit

 

History

History
335 lines (255 loc) · 9.45 KB

metrics.md

File metadata and controls

335 lines (255 loc) · 9.45 KB

TorchServe Metrics

Contents of this document

Introduction

TorchServe collects system level metrics in regular intervals, and also provides an API to collect custom metrics. Metrics collected by metrics are logged and can be aggregated by metric agents. The system level metrics are collected every minute. Metrics defined by the custom service code can be collected per request or per a batch of requests. TorchServe logs these two sets of metrics to different log files. Metrics are collected by default at:

  • System metrics - log_directory/ts_metrics.log
  • Custom metrics - log directory/model_metrics.log

The location of log files and metric files can be configured in the log4j2.xml file

System Metrics

Metric Name Dimension Unit Semantics
CPUUtilization host percentage CPU utilization on host
DiskAvailable host GB disk available on host
DiskUsed host GB disk used on host
DiskUtilization host percentage disk used on host
MemoryAvailable host MB memory available on host
MemoryUsed host MB memory used on host
MemoryUtilization host percentage memory utilization on host
GPUUtilization host,device_id percentage GPU utilization on host,device_id
GPUMemoryUtilization host,device_id percentage GPU memory utilization on host,device_id
GPUMemoryUsed host,device_id MB GPU memory used on host,device_id
Requests2XX host count logged for every request responded in 200-300 status code range
Requests4XX host count logged for every request responded in 400-500 status code range
Requests5XX host count logged for every request responded with status code above 500

Formatting

TorchServe emits metrics to log files by default. The metrics are formatted in a StatsD like format.

CPUUtilization.Percent:0.0|#Level:Host|#hostname:my_machine_name
MemoryUsed.Megabytes:13840.328125|#Level:Host|#hostname:my_machine_name   

To enable metric logging in JSON format, set "patternlayout" as "JSONPatternLayout" in log4j2.xml (See sample log4j2-json.xml). For information, see Logging in Torchserve.

After you enable JSON log formatting, logs will look as follows:

{
  "MetricName": "DiskAvailable",
  "Value": "108.15547180175781",
  "Unit": "Gigabytes",
  "Dimensions": [
    {
      "Name": "Level",
      "Value": "Host"
    }
  ],
  "HostName": "my_machine_name"
}
{
  "MetricName": "DiskUsage",
  "Value": "124.13163757324219",
  "Unit": "Gigabytes",
  "Dimensions": [
    {
      "Name": "Level",
      "Value": "Host"
    }
  ],
  "HostName": "my_machine_name"
}

To enable metric logging in QLog format, set "patternlayout" as "QLogLayout" in log4j2.xml (See sample log4j2-qlog.xml). For information, see Logging in Torchserve.

After you enable QLogsetupModelDependencies formatting, logs will look as follows:

HostName=abc.com
StartTime=1646686978
Program=MXNetModelServer
Metrics=MemoryUsed=5790.98046875 Megabytes Level|Host
EOE
HostName=147dda19895c.ant.amazon.com
StartTime=1646686978
Program=MXNetModelServer
Metrics=MemoryUtilization=46.2 Percent Level|Host
EOE

Custom Metrics API

TorchServe enables the custom service code to emit metrics that are then logged by the system.

The custom service code is provided with a context of the current request with a metrics object:

# Access context metrics as follows
metrics = context.metrics

All metrics are collected within the context.

Create dimension object(s)

Dimensions for metrics can be defined as objects

from ts.metrics.dimension import Dimension

# Dimensions are name value pairs
dim1 = Dimension(name, value)
dim2 = Dimension(some_name, some_value)
.
.
.
dimN= Dimension(name_n, value_n)

NOTE: Metric functions below accept a list of dimensions

Add generic metrics

One can add metrics with generic units using the following function.

Function API

    def add_metric(name, value, unit, idx=None, dimensions=None):
        """
        Add a metric which is generic with custom metrics

        Parameters
        ----------
        name : str
            metric name
        value: int, float
            value of metric
        idx: int
            request_id index in batch
        unit: str
            unit of metric
        dimensions: list
            list of dimensions for the metric
        """
# Add Distance as a metric
# dimensions = [dim1, dim2, dim3, ..., dimN]
# Assuming batch size is 1 for example
metrics.add_metric('DistanceInKM', distance, 'km', dimensions=dimensions)

Add time-based metrics

Add time-based by invoking the following method:

Function API

    def add_time(name, value, idx=None, unit='ms', dimensions=None):
        """
        Add a time based metric like latency, default unit is 'ms'

        Parameters
        ----------
        name : str
            metric name
        value: int
            value of metric
        idx: int
            request_id index in batch
        unit: str
            unit of metric,  default here is ms, s is also accepted
        dimensions: list
            list of dimensions for the metric
        """

Note that the default unit in this case is 'ms'

Supported units: ['ms', 's']

To add custom time-based metrics:

# Add inference time
# dimensions = [dim1, dim2, dim3, ..., dimN]
# Assuming batch size  is 1 for example
metrics.add_time('InferenceTime', end_time-start_time, None, 'ms', dimensions)

Add size-based metrics

Add size-based metrics by invoking the following method:

Function API

    def add_size(name, value, idx=None, unit='MB', dimensions=None):
        """
        Add a size based metric

        Parameters
        ----------
        name : str
            metric name
        value: int, float
            value of metric
        idx: int
            request_id index in batch
        unit: str
            unit of metric, default here is 'MB', 'kB', 'GB' also supported
        dimensions: list
            list of dimensions for the metric
        """

Note that the default unit in this case is milliseconds (ms).

Supported units: ['MB', 'kB', 'GB']

To add custom size based metrics

# Add Image size as a metric
# dimensions = [dim1, dim2, dim3, ..., dimN]
# Assuming batch size 1
metrics.add_size('SizeOfImage', img_size, None, 'MB', dimensions)

Add Percentage based metrics

Percentage based metrics can be added by invoking the following method

Function API

    def add_percent(name, value, idx=None, dimensions=None):
        """
        Add a percentage based metric

        Parameters
        ----------
        name : str
            metric name
        value: int, float
            value of metric
        idx: int
            request_id index in batch
        dimensions: list
            list of dimensions for the metric
        """

To add custom percentage-based metrics:

# Add MemoryUtilization as a metric
# dimensions = [dim1, dim2, dim3, ..., dimN]
# Assuming batch size 1
metrics.add_percent('MemoryUtilization', utilization_percent, None, dimensions)

Add counter-based metrics

Percentage based metrics can be added by invoking the following method

Function API

    def add_counter(name, value, idx=None, dimensions=None):
        """
        Add a counter metric or increment an existing counter metric

        Parameters
        ----------
        name : str
            metric name
        value: int
            value of metric
        idx: int
            request_id index in batch
        dimensions: list
            list of dimensions for the metric
        """

To create, increment and decrement counter-based metrics we can use the following calls:

# Add Loop Count as a metric
# dimensions = [dim1, dim2, dim3, ..., dimN]
# Assuming batch size 1

# Create a counter with name 'LoopCount' and dimensions, initial value
metrics.add_counter('LoopCount', 1, None, dimensions)

# Increment counter by 2 
metrics.add_counter('LoopCount', 2 , None, dimensions)

# Decrement counter by 1
metrics.add_counter('LoopCount', -1, None, dimensions)

# Final counter value in this case is 2

Log custom metrics

Following sample code can be used to log the custom metrics created in the model's custom handler:

for metric in metrics.store:
    logger.info("[METRICS]%s", str(metric))

This custom metrics information is logged in the model_metrics.log file configured through log4j2.xml file.