# Performance testing for OpenAI Models

This notebook enables the execution of performance testing of a open ai model. The notebook executes requests to the azure open ai endpoints, gathers metrics and creates a trace log in log analytics for each execution. You can then use kusto queries to analyse the results of the test runs.

Each run will have its unique ``runId`` that can be used to retrieve the logs for that specific run.

Each run consists on executing all the prompts for a use case, for all the concurrency levels that were configured, for a specific number of runs

To execute this notebook you must define the following 2 variables on the notebook file itself:
 - Use Case -> The use case that you need to test
 - Deployment names -> the open ai models that will be tested


You will also need to fill the .env file that is supplied in the package with the correct values for your environemnt:
```
APPLICATIONINSIGHTS_CONNECTION_STRING="<key for your application insights instance>"
AZURE_OPENAI_ENDPOINT="<url of the azure open ai instance>"
AZURE_OPENAI_VERSION="<open ai version>"
AZURE_OPENAI_KEY="<api key for open ai>"
OPENAI_TEMPERATURE=<temperature parameter of the openai call>
OPENAI_TOP_P=<top-p parameter of the openai call>
OPENAI_FREQUENCY_PENALTY=<frequency parameter of the openai call>
OPENAI_PRESENCE_PENALTY=<presence parameter of the openai call>
OPENAI_MAX_TOKENS=<max tokens parameter of the openai call>
OTEL_SERVICE_NAME="<Label to be used on app insights to identify this workload>"
```

To setup a use case you need to create a folder for that use case inside the useCases folder.
Inside each use case you need to create a .json for each prompt that will be executed.
The prompts are json files that already in the structure needed to pass to OpenAI, similar to the this example:

```json
[
    {
        "role": "system",
        "content": "You are an assistant. You should greet the user in at least 4 languages"
    }
]
```

Once everything is configured you can run the notebook.

The notebook will execute the following flow:

``````
for each prompt of the use case:
    for each concurrency level:
        for each open ai model:
            execute concurrently the requests to open ai model




### 1- Setup test

In [13]:
useCase = "UC1" # set here the use case you want to test
deploymentNames = ["functionsgpt35"] # the name of the AI models to test
concurrencyLevels = [1, 2, 4, 8, 16, 32, 64] # the concurrency levels to test
num_runs = 10 # the number of runs to perform for each prompt and concurrency level

### 1- Install dependencies and set up platform

In [None]:
%pip install openai==1.2.4
%pip install azure-monitor-opentelemetry==1.0.0b16
%pip install python-dotenv

In [None]:
import logging
from openai import AzureOpenAI
import os
from azure.monitor.opentelemetry import configure_azure_monitor
from dotenv import load_dotenv
import threading
import time
import uuid
import os
import json


load_dotenv()

configure_azure_monitor(
    connection_string=os.getenv("APPLICATION_INSIGHTS_CONNECTION_STRING")
)

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

In [30]:
client = AzureOpenAI(
    api_version = os.getenv("AZURE_OPENAI_VERSION"),
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key = os.getenv("AZURE_OPENAI_KEY")
)


### 3- Define functions to call open ai


In [None]:
def runOpenAIChatCompletion(prompt, model, concurrency, runId, useCase, startTime, file):
    failed = False
    tokens = 0
    prompt_tokens = 0
    completion_tokens = 0
    statusCode = 200
    tic = time.perf_counter()
 
    try:
        completion  = client.chat.completions.create(
            model=model,
            messages=prompt,
            temperature=float(os.getenv("OPENAI_TEMPERATURE")),
            max_tokens=int(os.getenv("OPENAI_MAX_TOKENS")),
            top_p=float(os.getenv("OPENAI_TOP_P")),
            frequency_penalty=float(os.getenv("OPENAI_FREQUENCY_PENALTY")),
            presence_penalty=float(os.getenv("OPENAI_PRESENCE_PENALTY")),
            stop=None
        )
        tokens = completion.usage.total_tokens
        prompt_tokens = completion.usage.prompt_tokens
        completion_tokens = completion.usage.completion_tokens
    except Exception as e:
        print(e)
        statusCode = e.status_code
        failed = True
    toc = time.perf_counter()
    logger.info("Executed open ai", extra={
        "model": model,
        "timeElapsed": toc-tic,
        "concurrency": concurrency,
        "failed": failed,
        "total_tokens": tokens,
        "completion_tokens": completion_tokens,
        "prompt_tokens" : prompt_tokens,
        "runId": str(runId),
        "useCase": useCase,
        "startTime": startTime,
        "statusCode": statusCode,
        "prompt" : file
    })

### 4- Tests execution

In [None]:
prompts = []
for file in os.listdir(f"useCases/{useCase}"):
    if file.endswith(".json"):
        with open(f"useCases/{useCase}/{file}", "r") as f:
            content = f.read()
            prompts.append({"file" : file, "content": json.loads(content)})
 
threads = []
 
print("Readed prompts:", len(prompts))
 
runId = uuid.uuid4()
 
 
startTime = time.time()
for concurrency in concurrencyLevels:
    for prompt in prompts:
        for model in deploymentNames:
            for i in range(0,num_runs):
                while(len(threads)<concurrency):
                    threads.append(threading.Thread(target=runOpenAIChatCompletion,args=[prompt["content"], model, concurrency, runId, useCase, startTime, prompt["file"]]))
                    continue
                print(f"RunId: {runId} -- Run {i} -- Model {model} -> Starting {concurrency} threads")
                for thread in threads:
                    thread.start()
                for thread in threads:
                    thread.join()
                threads = []
                if(concurrency > 32):
                    time.sleep(60)
                else:
                    time.sleep(10)
 
print("Done for runid {}".format(runId))

### 5- Get results from log analytics

You can retrieve the test results by quering the Log Analytics workspace to where the traces were dumped.
Use the runId that was generated earlier to identifiy the test results

Execute the following queries on the models

**Performance monitoring**
```sql
AppTraces
| where Message == "Executed open ai"
| where tostring(Properties["runId"]) == "<run id>"
| project model=tostring(Properties["model"]), timeElapsed=todecimal(Properties["timeElapsed"]), concurrency=toint(Properties["concurrency"])
| summarize ["Average time in seconds"]=avg(timeElapsed), percentile(timeElapsed,25), percentile(timeElapsed,50), percentile(timeElapsed,75), percentile(timeElapsed,90) by model, concurrency
| order by concurrency asc, model asc
```

**Failures count**
```sql
AppTraces
| where Message == "Executed open ai"
| where tostring(Properties["runId"]) == "<run id>"
| summarize count() by tostring(Properties["runId"])
| project Total=count_, RunId=Properties_runId
| join (
AppTraces
| where Message == "Executed open ai"
| where tostring(Properties["runId"]) == "<run id>"
| project model=tostring(Properties["model"]), failed=tobool(Properties["failed"]), concurrency=toint(Properties["concurrency"]), RunId=tostring(Properties["runId"]), errorCode = tostring(Properties["statusCode"])
| where failed==true
| summarize ["Number of results"]=count() by failed, model, concurrency, RunId, errorCode
| order by concurrency asc, model asc) on $left.RunId ==  $right.RunId
| project model, concurrency,errorCode, percentageFailed = round((['Number of results']/ todecimal(Total))*100,2)
```