## 📚 Prerequisites

Ensure that your Azure Services are properly set up, your Conda environment is created, and your environment variables are configured as per the instructions in the [SETTINGS.md](SETTINGS.md) file.

## 📋 Table of Contents

This notebook assists in conducting a comprehensive performance assessment for Azure OpenAI endpoints, focusing on the operational efficiency of the model in processing requests. The following sections are covered:

1. [**Latency Testing**](#latency-testing): This section explores how to conduct latency tests on Azure OpenAI endpoints. Latency measures the response time for a request, assessing how quickly the model responds to a specific request.

2. [**Throughput Testing**](#throughput-testing): This part details the steps to perform throughput tests on Azure OpenAI endpoints. Throughput evaluates the number of requests the model can handle in a given time frame, providing an understanding of the model's capacity and efficiency.

3. [**Analyzing Test Results**](#analyzing-test-results): This section provides guidance on how to analyze the results from the latency and throughput tests, helping you understand the performance metrics and their implications on the operational efficiency of your model.

For additional information, refer to the following resources:
- [Azure OpenAI API Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference)

In [None]:
import os
from datetime import datetime

# Define the target directory (change yours)
TARGET_DIRECTORY = r"C:\Users\pablosal\Desktop\azure-openai-benchmark"

# Check if the directory exists
if os.path.exists(TARGET_DIRECTORY):
    # Change the current working directory
    os.chdir(TARGET_DIRECTORY)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {TARGET_DIRECTORY} does not exist.")

Directory C:\Users\pablosal\Desktop\azure-openai-benchmark does not exist.


In [6]:
# SETUP 
MODEL = "gpt-4-turbo-2024-04-09-ptu"
REGION = "swedencentral"
OAZURE_OPENAI_API_KEY = os.getenv("OPENAI_API_KEY_SWEEDENCENTRAL_PTU")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT_SWEEDENCENTRAL_PTU")
AZURE_OPENAI_API_VERSION = "2024-02-15-preview"

filename = f"benchmarks/gpt-4-turbo/{REGION}/{MODEL}/"


## Run Latency Benchmarking

In [7]:
from src.performance.aoaihelpers.latencytest import (
    AzureOpenAIBenchmarkStreaming,
    AzureOpenAIBenchmarkNonStreaming,
)

In [8]:
# Define the deployment names and tokens
deployment_names = [MODEL]
max_tokens_list = [100, 600, 700, 800, 900, 1000]
num_iterations = 1

In [9]:
# Create an instance of the benchmarking class
client_non_streaming = AzureOpenAIBenchmarkNonStreaming(
    api_key=OAZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version=AZURE_OPENAI_API_VERSION
)

# Run the benchmark tests
await client_non_streaming.run_latency_benchmark_bulk(
    deployment_names, max_tokens_list, iterations=num_iterations, context_tokens=1000, multiregion=False
)

2024-05-13 18:08:27,141 - micro - MainProcess - INFO     CPU usage: 18.9% (utils.py:log_system_info:200)
2024-05-13 18:08:27,151 - micro - MainProcess - INFO     RAM usage: 87.1% (utils.py:log_system_info:202)
2024-05-13 18:08:27,624 - micro - MainProcess - INFO     Initiating call for Model: gpt-4-turbo-2024-04-09-ptu, Max Tokens: 100 (latencytest.py:make_call:514)
INFO:micro:Initiating call for Model: gpt-4-turbo-2024-04-09-ptu, Max Tokens: 100
2024-05-13 18:08:27,656 - micro - MainProcess - INFO     CPU usage: 12.8% (utils.py:log_system_info:200)
INFO:micro:CPU usage: 12.8%
2024-05-13 18:08:27,671 - micro - MainProcess - INFO     RAM usage: 87.6% (utils.py:log_system_info:202)
INFO:micro:RAM usage: 87.6%
2024-05-13 18:08:27,785 - micro - MainProcess - INFO     Initiating call for Model: gpt-4-turbo-2024-04-09-ptu, Max Tokens: 600 (latencytest.py:make_call:514)
INFO:micro:Initiating call for Model: gpt-4-turbo-2024-04-09-ptu, Max Tokens: 600
2024-05-13 18:08:27,788 - micro - MainProc

In [11]:
stats = client_non_streaming.calculate_and_show_statistics()
now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
latency_file_name = filename + f"results_iterations={num_iterations}_time={timestamp}.json"
client_non_streaming.save_statistics_to_file(stats, location=latency_file_name)

+---------------------------------+------------+----------------+--------------------+----------+----------------------+----------------------+---------+----------------------+-------------------+--------------------------+-----------------------+-----------------------------------+-----------------------------------+----------------------+------------+-------------+-----------------+-------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|         Model_MaxTokens         | Iterations |    Regions     |    Median Time     | IQR Time | 95th Percentile Time | 99th Percentile Time | CV Time | Median Prompt Tokens | IQR Prompt Tokens | Median Completion Tokens | IQR

## Throughput Benchmarking

In [None]:
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Now you can access the environment variables using os.getenv
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
DEPLOYMENT_ID = os.getenv("AZURE_AOAI_DEPLOYMENT_NAME")
DEPLOYMENT_VERSION = os.getenv("AZURE_AOAI_API_VERSION")

In [None]:
import logging

# Create a custom logger
logger = logging.getLogger(__name__)

# Set the level of this logger. This level acts as a threshold.
# Any message logged at this level, or higher, will be passed to this logger's handlers.
logger.setLevel(logging.DEBUG)

# Create handlers
c_handler = logging.StreamHandler()
c_handler.setLevel(logging.DEBUG)
c_format = logging.Formatter("%(name)s - %(levelname)s - %(message)s")
c_handler.setFormatter(c_format)
logger.addHandler(c_handler)

In [None]:
from src.performance.client import LoadTestBenchmarking

# Create a client
benchmarking_client = LoadTestBenchmarking(
    model="gpt-4-1106-pagyo",
    region="swedencentral",
    endpoint=AZURE_OPENAI_ENDPOINT,
)

In [None]:
benchmarking_client.run_tests(
    deployment="gpt-4-1106-pagyo",
    rate=5,
    duration=180,
    shape_profile="custom",
    clients=10,
    context_tokens=1000,
    max_tokens=500,
    prevent_server_caching=True,
    log_save_dir="logs/",
)