## 📚 Prerequisites

Ensure that your Azure Services are properly configured, your Conda environment is set up, and your environment variables are aligned with the guidelines provided in the [SETTINGS.md](SETTINGS.md) file.

## 📋 Table of Contents


1. [**Deeper Latency Analysis with Custom Code**](#deeper-latency-analysis-with-custom-code): For a more comprehensive analysis of latency using custom Python code, this section guides you through utilizing the Azure OpenAI SDK to measure Time-To-First-Token (TTFT), Time-To-Last-Token (TTLT), and Time-Between-Tokens (TBT). This involves:

    - **Streaming Calls**: Detailed instructions and code snippets for analyzing latency in streaming calls.
    
    - **Non-Streaming Calls**: Guidance and examples for evaluating latency in non-streaming scenarios.


For further details, please consult the following resources:
- [Azure OpenAI API Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference)

In [1]:
import os
import logging
from json.decoder import JSONDecodeError
from datetime import datetime
from typing import Tuple, Optional, Dict, Any

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [2]:
# Define the target directory (change yours)
TARGET_DIRECTORY = r"C:\Users\pablosal\Desktop\gbbai-azure-openai-benchmark"

# Check if the directory exists
if os.path.exists(TARGET_DIRECTORY):
    # Change the current working directory
    os.chdir(TARGET_DIRECTORY)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {TARGET_DIRECTORY} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-openai-benchmark


## Configure API Calls for Testing

In [3]:
YOUR_RESOURCE_NAME = os.getenv("AZURE_OPENAI_API_ENDPOINT")
YOUR_DEPLOYMENT_NAME = os.getenv("AZURE_AOAI_CHAT_MODEL_NAME_DEPLOYMENT_ID")
YOUR_DEPLOYMENT_VERSION = os.getenv("AZURE_OPENAI_API_VERSION")
YOUR_AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_KEY")

url = f"{YOUR_RESOURCE_NAME}openai/deployments/{YOUR_DEPLOYMENT_NAME}/chat/completions?api-version={YOUR_DEPLOYMENT_VERSION}"
payload = {
    "messages": [
        {"role": "system", "content": "You are an AI assistant that helps people find information."},
        {"role": "user", "content": "Write a short essay on the importance of AI."}
    ],
    "stream": True,
    "max_tokens": 500
}
headers = {
    "api-key": f"{YOUR_AZURE_OPENAI_API_KEY}",
    "Content-Type": "application/json"
}

## Streaming Calls

In [22]:
import requests
import time
import json
import logging
from functools import wraps
from time import perf_counter
from typing import Callable, Dict, Any, Tuple, Optional, List

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def latency_metrics_collector_streaming(func: Callable) -> Callable:
    """
    Decorator to collect latency metrics (TTFT, TTLT, TBTs) for a streaming API call.

    Args:
        func (Callable): The function to wrap with latency metric collection.

    Returns:
        Callable: The wrapped function with added latency metric collection.
    """
    @wraps(func)
    def wrapper(*args, **kwargs) -> Tuple[Optional[float], Optional[float], Optional[List[float]], str]:
        start_time = perf_counter()
        ttft = None
        tbts = []
        previous_time = start_time

        # Call the actual function
        result = func(*args, **kwargs)

        # Calculate TTLT
        ttl = perf_counter() - start_time
        logger.info(f"TTLT captured: {ttl:.4f} seconds")

        # Extract the final text response
        final_text_response = result.get("final_text_response", "")

        for response_time in result.get("response_times", []):
            if ttft is None:
                ttft = response_time - start_time
                logger.info(f"TTFT captured: {ttft:.4f} seconds")
            last_tbt = (response_time - previous_time)
            tbts.append(last_tbt)
            previous_time = response_time

        if tbts:
            logger.info(f"TBTs captured: {tbts}")

        return ttft, ttl, tbts, final_text_response

    return wrapper

@latency_metrics_collector_streaming
def make_api_call(url: str, payload: Dict[str, Any], headers: Dict[str, str]) -> Dict[str, Any]:
    """
    Make a streaming API call and return the final text response along with response times for each chunk.

    Args:
        url (str): The URL endpoint for the API call.
        payload (Dict[str, Any]): The JSON payload for the API request.
        headers (Dict[str, str]): The headers for the API request.

    Returns:
        Dict[str, Any]: A dictionary containing the final text response and the response times for each chunk.
    """
    response_times = []
    final_text_response = ""

    try:
        with requests.post(url, headers=headers, json=payload, stream=True) as response:
            response.raise_for_status()
            for chunk in response.iter_lines():
                if chunk:
                    response_times.append(perf_counter())
                    decoded_chunk = chunk.decode('utf-8')
                    if decoded_chunk.startswith("data: "):
                        json_string = decoded_chunk[6:]
                        if json_string:
                            try:
                                data = json.loads(json_string)
                                if 'choices' in data and data['choices']:
                                    event_text = data['choices'][0].get('delta', {}).get('content', '')
                                    if event_text:
                                        final_text_response += event_text
                                        print(event_text, end="", flush=True)
                                        time.sleep(0.01)  # Maintain minimal sleep to reduce latency
                            except json.JSONDecodeError as e:
                                logger.error(f"Error decoding JSON: {e}")

    except requests.exceptions.RequestException as e:
        logger.error(f"An error occurred: {e}")

    return {"final_text_response": final_text_response.strip(), "response_times": response_times}

In [23]:
# Call the API and obtain metrics
ttft, ttl, tbts, final_text_response = make_api_call(url, payload, headers)

# Print the collected metrics and the final text response
print(f"Time-To-First-Token (TTFT): {ttft} seconds")
print(f"Time-To-Last-Token (TTLT): {ttl} seconds")
print(f"Time-Between-Tokens (TBTs): {tbts} seconds each")

Artificial Intelligence (AI) has become a transformative force in modern society, redefining industries, enhancing the quality of life, and opening new frontiers in science and technology. The importance of AI is multi-faceted, influencing a broad spectrum of areas from healthcare and education to business and entertainment.

One of the primary advantages of AI is its ability to process vast amounts of data with unprecedented speed and accuracy. This capability is revolutionizing healthcare by enabling early diagnosis and personalized treatment. For example, machine learning algorithms can analyze medical images to detect conditions such as cancer at a much earlier stage than traditional methods. Additionally, AI-driven predictive analytics can identify potential outbreaks of diseases, allowing for timely intervention and better resource allocation.

In the realm of education, AI is personalizing learning experiences for students. Adaptive learning platforms assess the strengths and we

ERROR:__main__:Error decoding JSON: Expecting value: line 1 column 2 (char 1)
INFO:__main__:TTLT captured: 10.5599 seconds
INFO:__main__:TTFT captured: 0.5350 seconds
INFO:__main__:TBTs captured: [0.5350157999998828, 0.0001340999999683845, 0.48128940000015064, 0.02016959999991741, 0.015498400000069523, 0.016496999999844775, 0.015036700000109704, 0.015097299999979441, 0.01570480000009411, 0.01587229999995543, 0.015497599999889644, 0.01551629999994475, 0.015533900000036738, 0.015543600000000879, 0.015268399999968096, 0.014801300000044648, 0.01459330000011505, 0.01598639999997431, 0.0155606999999236, 0.015366699999958655, 0.014709700000139492, 0.015072400000008201, 0.014870699999846693, 0.015700600000172926, 0.015601999999944383, 0.015396300000020346, 0.015470399999912843, 0.015261499999951411, 0.015335000000050059, 1.273691000000099, 0.016419000000041706, 0.01492499999994834, 0.015710599999920305, 0.015964500000109183, 0.015359199999920747, 0.01570769999989352, 0.01500730000020667, 0.016

Time-To-First-Token (TTFT): 0.5350157999998828 seconds
Time-To-Last-Token (TTLT): 10.559917599999835 seconds
Time-Between-Tokens (TBTs): [0.5350157999998828, 0.0001340999999683845, 0.48128940000015064, 0.02016959999991741, 0.015498400000069523, 0.016496999999844775, 0.015036700000109704, 0.015097299999979441, 0.01570480000009411, 0.01587229999995543, 0.015497599999889644, 0.01551629999994475, 0.015533900000036738, 0.015543600000000879, 0.015268399999968096, 0.014801300000044648, 0.01459330000011505, 0.01598639999997431, 0.0155606999999236, 0.015366699999958655, 0.014709700000139492, 0.015072400000008201, 0.014870699999846693, 0.015700600000172926, 0.015601999999944383, 0.015396300000020346, 0.015470399999912843, 0.015261499999951411, 0.015335000000050059, 1.273691000000099, 0.016419000000041706, 0.01492499999994834, 0.015710599999920305, 0.015964500000109183, 0.015359199999920747, 0.01570769999989352, 0.01500730000020667, 0.016007099999796992, 0.016811800000141375, 0.015338799999881303

## Non-Streaming 

In [26]:
import tiktoken

def count_tokens_tiktoken(text: str, encoding_name: str = "cl100k_base") -> int:
    """
    Count the number of tokens in a text string using tiktoken.

    Args:
        text (str): The text to count tokens in.
        encoding_name (str): The name of the encoding to use.

    Returns:
        int: The number of tokens.
    """
    encoding = tiktoken.get_encoding(encoding_name)
    return len(encoding.encode(text))

def latency_metrics_collector_non_streaming(func: Callable) -> Callable:
    """
    Decorator to collect latency metrics (TTFT, TTLT, TBTs) for a non-streaming API call.

    Args:
        func (Callable): The function to wrap with latency metric collection.

    Returns:
        Callable: The wrapped function with added latency metric collection.
    """
    @wraps(func)
    def wrapper(*args, **kwargs) -> Tuple[Optional[float], Optional[float], Optional[List[float]], str]:
        start_time = perf_counter()

        # Call the actual function
        result = func(*args, **kwargs)

        # Calculate TTLT
        ttl = perf_counter() - start_time
        logger.info(f"TTLT captured: {ttl:.4f} seconds")

        # Extract the final text response
        final_text_response = result.get("final_text_response", "")

        # Count the number of tokens
        token_count = count_tokens_tiktoken(final_text_response)
        logger.info(f"Total tokens in response: {token_count}")

        # Calculate TBT
        if token_count > 1:
            avg_tbt = (ttl * 1000) / (token_count - 1)  # Convert TTL to milliseconds and calculate TBT
            tbts = [avg_tbt] * (token_count - 1)  # Create a list with the same TBT value for all tokens
        else:
            tbts = []

        if tbts:
            logger.info(f"TBTs captured: {tbts}")

        return None, ttl, tbts, final_text_response

    return wrapper

@latency_metrics_collector_non_streaming
def make_api_call_non_streaming(url: str, payload: Dict[str, Any], headers: Dict[str, str]) -> Dict[str, Any]:
    """
    Make a non-streaming API call and return the final text response.

    Args:
        url (str): The URL endpoint for the API call.
        payload (Dict[str, Any]): The JSON payload for the API request.
        headers (Dict[str, str]): The headers for the API request.

    Returns:
        Dict[str, Any]: A dictionary containing the final text response.
    """
    final_text_response = ""

    try:
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()
        final_text_response = response.text
    except requests.exceptions.RequestException as e:
        logger.error(f"An error occurred: {e}")

    return {"final_text_response": final_text_response.strip()}


In [27]:
# Call the API and obtain metrics
ttft, ttl, tbts, final_text_response = make_api_call_non_streaming(url, payload, headers)

# Print the collected metrics and the final text response
print(f"Time-To-First-Token (TTFT): {ttft} seconds")
print(f"Time-To-Last-Token (TTLT): {ttl} seconds")
print(f"Time-Between-Tokens (TBTs): {tbts} seconds each")

INFO:__main__:TTLT captured: 8.2516 seconds
INFO:__main__:Total tokens in response: 70301
INFO:__main__:TBTs captured: [0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742

Time-To-First-Token (TTFT): None seconds
Time-To-Last-Token (TTLT): 8.25161890000004 seconds
Time-Between-Tokens (TBTs): [0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.11737722475106742, 0.117377224751067