## Azure Open AI Proxy

In complex, distributed scenarios, we may have the need to multiplex between multiple Open AI clients and multiple Azure Open AI deployments. In addition, it would be good to be able to track and attribute the cost of the requests per user and/or endpoint.

This does exactly that, by creating a simple proxy HTTP server that sits between the client the Azure Open AI endpoint, dispatches the request to one of several endpoints, and tracks the usage and cost after each request.

...

We'll start by installing FastAPI and Uvicorn (for the HTTP server), and OpenAI and Requests for making requsts.

In [46]:
%pip install fastapi uvicorn openai requests
from IPython.display import clear_output ; clear_output()

In our config file, we have 4 users, each with their own API key (note that this isn't a valid key for an actual Azure Open AI endpoint but rather one that we are maintaining ourselves for each users), a list of Azure Open AI endpoints, and finally, the costs per token, so that we can calculate the cost for each user and endpoint.

In [58]:
%pycat aoai_proxy_config.py

[0;31m# These are our application's users. Each of them has their own API key.[0m[0;34m[0m
[0;34m[0m[0mUSERS[0m [0;34m=[0m [0;34m{[0m[0;34m[0m
[0;34m[0m    [0;34m'Angela'[0m[0;34m:[0m [0;34m{[0m[0;34m[0m
[0;34m[0m        [0;34m'api_key'[0m[0;34m:[0m [0;34m'angela-12345'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m'Benjamin'[0m[0;34m:[0m [0;34m{[0m[0;34m[0m
[0;34m[0m        [0;34m'api_key'[0m[0;34m:[0m [0;34m'benjamin-23456'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m'Cynthia'[0m[0;34m:[0m [0;34m{[0m[0;34m[0m
[0;34m[0m        [0;34m'api_key'[0m[0;34m:[0m [0;34m'cynthia-34567'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m'David'[0m[0;34m:[0m [0;34m{[0m[0;34m[0m
[0;34m[0m        [0;34m'api_key'[0m[0;34m:[0m [0;34m'david-45678'[0m[0;34m,[0m[0;34m[0

Our proxy server imports the config, and adds slots for tracking usage and cost per user and endpoint. It defines two REST endpoints. One is identical to the Open AI API chat completion endpoint, so that we can call it in exactly the same way we would if it were a direct call to an Azure Open AI API endpoint. It authenticartes our user based on their API key, makes a request to one of our Azure Open AI endpoints (which it returns as is to the caller) and records the use tokens as returned from the API. The other endpoint reports the usage and cost per user, endpoint and total.

...

Let's take a look ...

In [48]:
%pycat aoai_proxy_server.py

[0;32mfrom[0m [0mfastapi[0m [0;32mimport[0m [0mFastAPI[0m[0;34m,[0m [0mRequest[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mopenai[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mrandom[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0maoai_proxy_config[0m [0;32mimport[0m [0mUSERS[0m[0;34m,[0m [0mAOAI_ENDPOINTS[0m[0;34m,[0m [0mCOMPLETION_TOKEN_COST[0m[0;34m,[0m [0mPROMPT_TOKEN_COST[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# For each user and enpoint we'll keep track of how many tokens they've used.[0m[0;34m[0m
[0;34m[0m[0;32mfor[0m [0m_[0m[0;34m,[0m [0muser[0m [0;32min[0m [0mUSERS[0m[0;34m.[0m[0mitems[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0muser[0m[0;34m[[0m[0;34m'usage'[0m[0;34m][0m [0;34m=[0m [0;34m{[0m[0;34m[0m
[0;34m[0m        [0;34m'total_completion_tokens'[0m[0;34m:[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0;34m'total_prompt_tokens'[0m[0;34m:[0m [0;

We'll run the server in the background, so that we can make requests to it (on localhost, port 8000).

In [49]:
import os
from time import sleep

os.system("""uvicorn aoai_proxy_server:app &""")
sleep(3)

INFO:     Started server process [9863]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)


To test our proxy server, we deine a `chat` funtion, that makes a chat completion request from one of our users (using their API key), prints out the response, then calls the proxy server's usage endpoint and displays the current usage stats. 

In [50]:
import openai
import requests

from aoai_proxy_config import USERS

openai.api_type = 'azure'
openai.api_version = '2023-03-15-preview'
openai.api_base = 'http://127.0.0.1:8000'

deployment_id = 'gpt-35-turbo' # Replace if using a different deployment name

def chat(user, prompt):
    openai.api_key = USERS[user]['api_key']
    completion = openai.ChatCompletion.create(
        deployment_id='gpt-35-turbo',
        messages=[{'role': 'user', 'content': prompt}],
    )
    print(f'{user}: {prompt}')
    print(completion.choices[0]['message']['content'])
    print('---------------')
    usage = requests.get('http://127.0.0.1:8000/usage').json()
    print(f'Total cost: ${usage["total_cost"]:.7f}')
    for user, user_usage in usage['users'].items():
        print(f'{user} cost: ${user_usage["total_cost"]:.7f}')
    for endpoint, endpoint_usage in usage['endpoints'].items():
        print(f'{endpoint} cost: ${endpoint_usage["total_cost"]:.7f}')

Now let's make some requests and see how it behaves.

In [51]:
chat('Angela', 'What is the capital of France?')

INFO:     127.0.0.1:51042 - "POST /openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-03-15-preview HTTP/1.1" 200 OK
Angela: What is the capital of France?
The capital of France is Paris.
---------------
INFO:     127.0.0.1:51044 - "GET /usage HTTP/1.1" 200 OK
Total cost: $0.0000440
Angela cost: $0.0000440
Benjamin cost: $0.0000000
Cynthia cost: $0.0000000
David cost: $0.0000000
endpoint 1 cost: $0.0000000
endpoint 2 cost: $0.0000440


In [52]:
chat('Benjamin', 'Write a haiku about rainbows.')

INFO:     127.0.0.1:51042 - "POST /openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-03-15-preview HTTP/1.1" 200 OK
Benjamin: Write a haiku about rainbows.
Rainbow in the sky,
Colors blend in harmony,
Nature's gift of light.
---------------
INFO:     127.0.0.1:51045 - "GET /usage HTTP/1.1" 200 OK
Total cost: $0.0001100
Angela cost: $0.0000440
Benjamin cost: $0.0000660
Cynthia cost: $0.0000000
David cost: $0.0000000
endpoint 1 cost: $0.0000000
endpoint 2 cost: $0.0001100


In [53]:
chat('Cynthia', 'Count from 1 to 23.')

INFO:     127.0.0.1:51042 - "POST /openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-03-15-preview HTTP/1.1" 200 OK
Cynthia: Count from 1 to 23.
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23.
---------------
INFO:     127.0.0.1:51046 - "GET /usage HTTP/1.1" 200 OK
Total cost: $0.0002780
Angela cost: $0.0000440
Benjamin cost: $0.0000660
Cynthia cost: $0.0001680
David cost: $0.0000000
endpoint 1 cost: $0.0000000
endpoint 2 cost: $0.0002780


In [54]:
chat('David', 'What are 7 words starting with the letter "a"?')

INFO:     127.0.0.1:51042 - "POST /openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-03-15-preview HTTP/1.1" 200 OK
David: What are 7 words starting with the letter "a"?
Apple, airplane, animal, avocado, artichoke, ambulance, astronaut.
---------------
INFO:     127.0.0.1:51047 - "GET /usage HTTP/1.1" 200 OK
Total cost: $0.0003500
Angela cost: $0.0000440
Benjamin cost: $0.0000660
Cynthia cost: $0.0001680
David cost: $0.0000720
endpoint 1 cost: $0.0000000
endpoint 2 cost: $0.0003500


In [55]:
chat('Angela', 'Who was the 11th president of the US?')

INFO:     127.0.0.1:51042 - "POST /openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-03-15-preview HTTP/1.1" 200 OK
Angela: Who was the 11th president of the US?
James K. Polk.
---------------
INFO:     127.0.0.1:51049 - "GET /usage HTTP/1.1" 200 OK
Total cost: $0.0004000
Angela cost: $0.0000940
Benjamin cost: $0.0000660
Cynthia cost: $0.0001680
David cost: $0.0000720
endpoint 1 cost: $0.0000500
endpoint 2 cost: $0.0003500


In [56]:
chat('Cynthia', 'What number comes after 17?')

INFO:     127.0.0.1:51042 - "POST /openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-03-15-preview HTTP/1.1" 200 OK
Cynthia: What number comes after 17?
18.
---------------
INFO:     127.0.0.1:51051 - "GET /usage HTTP/1.1" 200 OK
Total cost: $0.0004340
Angela cost: $0.0000940
Benjamin cost: $0.0000660
Cynthia cost: $0.0002020
David cost: $0.0000720
endpoint 1 cost: $0.0000500
endpoint 2 cost: $0.0003840


As we can see, the proxy server correctly authenticates the user and makes a request on their behalf, dispatches the chat completion request to one of our available Azure Open AI endpoints, and keeps track of the costs per user and endpoint.

...

Before we go, we'll clean up by killing the proxy server that is running in the background.

In [57]:
os.system("""kill $(ps aux | grep '[u]vicorn' | awk '{print $2}')""")

0

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [9863]
