## Optimizing Performance
Triton offers several tools to help tune your model deployment parameters and optimize your target metrics, whether that be throughput, latency, device utilization, or some other measure of performance. Some of these optimizations depend on expected server load and whether inference requests will be submitted in batches or one at a time from clients. Triton's performance analysis tools allow you to test performance based on a wide range of anticipated scenarios and modify deployment parameters accordingly. For this example, we will make use of Triton's `perf_analyzer` [tool](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md#performance-analyzer), which allows us to quickly measure throughput and latency based on different batch sizes and request concurrency.

In [None]:
!pip install tritonclient[all]
!apt update
!apt install libb64-0d

In [3]:
# CPU serving with 6 batches and 6 concurrent requests results in high throughput but with high latency
!perf_analyzer -m model-cpu -b 6 --concurrency-range 6:6

*** Measurement Settings ***
  Batch size: 6
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 6 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 6
  Client: 
    Request count: 14331
    Throughput: 17197.2 infer/sec
    Avg latency: 2091 usec (standard deviation 639 usec)
    p50 latency: 1954 usec
    p90 latency: 2396 usec
    p95 latency: 2820 usec
    p99 latency: 4613 usec
    Avg HTTP time: 2093 usec (send/recv 41 usec + response wait 2052 usec)
  Server: 
    Inference count: 103002
    Execution count: 5843
    Successful request count: 17167
    Avg request latency: 1872 usec (overhead 864 usec + queue 837 usec + compute input 1 usec + compute infer 165 usec + compute output 5 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 6, throughput: 17197.2 infer/sec, latency 2091 usec


In [4]:
# By comparison, GPU serving results in higher throughput with lower latency
!perf_analyzer -m model -b 6 --concurrency-range 6:6

*** Measurement Settings ***
  Batch size: 6
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 6 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 6
  Client: 
    Request count: 62835
    Throughput: 75402 infer/sec
    Avg latency: 476 usec (standard deviation 189 usec)
    p50 latency: 464 usec
    p90 latency: 520 usec
    p95 latency: 559 usec
    p99 latency: 783 usec
    Avg HTTP time: 474 usec (send/recv 34 usec + response wait 440 usec)
  Server: 
    Inference count: 452454
    Execution count: 17269
    Successful request count: 75409
    Avg request latency: 245 usec (overhead 104 usec + queue 123 usec + compute input 5 usec + compute infer 6 usec + compute output 7 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 6, throughput: 75402 infer/sec, latency 476 usec


## Conclusion
In these notebooks, we showed how to deploy an XGBoost model in Triton using the new FIL backend. While it is possible to deploy these models on both CPU and GPU in Triton, GPU-deployed models offer far higher throughput at lower latency. As a result, we can deploy more sophisticated models on the GPU for any given latency budget and thereby obtain far more accurate results.