# AB testing with phospho

In this notebook, we will cover how to create a simple AB test with phospho for a chatbot built on top of OpenAI ChatGPT API.

In [1]:
!pip install --upgrade phospho
! pip install python-dotenv
! pip install openai
! pip install requests



In [12]:
import os
import dotenv
import phospho
import requests
# Load the .env file
dotenv.load_dotenv()

# Check that the required environment variables are set
assert 'PHOSPHO_API_KEY' in os.environ, 'PHOSPHO_API_KEY not set'
assert 'PHOSPHO_PROJECT_ID' in os.environ, 'PHOSPHO_API_SECRET not set'
assert 'OPENAI_API_KEY' in os.environ, 'OPENAI_API_KEY not set'

PHOSPHO_BASE_URL = "http://127.0.0.1:8000"

In [2]:
# Create a dummy agent (random based)
from openai import OpenAI
client = OpenAI() 

def my_answer_generation(user_prompt: str, model_name: str) -> str:
    completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": user_prompt}
    ]
    )

    return(completion.choices[0].message.content)

In [3]:
# Try the agent
my_answer_generation("What is the capital of France?", "gpt-3.5-turbo")

'The capital of France is Paris.'

In [7]:
# Init the phospho client
phospho.init(base_url=f"{PHOSPHO_BASE_URL}/v2")

# Update the function to add logging to phospho
def my_answer_generation(user_prompt: str, model_name: str) -> str:

    completion = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": user_prompt}
    ]
    )

    answer = completion.choices[0].message.content

    phospho.log(
        input=user_prompt,
        output=answer,
    )

    return answer

my_answer_generation("What is the capital of France?", "gpt-4")

'The capital of France is Paris.'

## Creating an AB Test

Let's now create an AB test with phospho to choose wether to use GPT 3.5 Turbo or GPT 4 for our chatbot.

In [8]:
# Create an AB test

# Define the JSON payload for the ABTestCreationRequest
payload = {
    "name": "Test A vs. Test B",
    "description": "A description of the AB test comparing variations A and B.",
    "variations": [
        {"variation_name": "GPT 3.5 turbo", "allocation_rate": 0.5, "features": {"model_name": "gpt-3.5-turbo"}},
        {"variation_name": "GPT 4", "allocation_rate": 0.5, "features": {"model_name": "gpt-4"}}
    ]
}

# Include the necessary headers, such as Authorization if needed
headers = {
    "Authorization": f"Bearer {os.environ['PHOSPHO_API_KEY']}",
    "Content-Type": "application/json"
}

# Send the POST request
url = f"{PHOSPHO_BASE_URL}/v2/projects/{os.environ['PHOSPHO_PROJECT_ID']}/ab-tests"
response = requests.post(url, json=payload, headers=headers)

# Check the response
if response.status_code == 200:
    print("AB test created successfully.")
    print("Response:", response.json())
    ab_test_id = response.json()["id"]
else:
    print("Failed to create AB test. Status code:", response.status_code)
    print("Response:", response.text)

AB test created successfully.
Response: {'id': '4368a0db440e489a9ebb5f2120b77cff', 'project_id': '6d03f091290b4719809c489b78c2344b', 'org_id': '3fe248a3-834c-4c26-8dcc-4e55112f702d', 'name': 'Test A vs. Test B', 'description': 'A description of the AB test comparing variations A and B.', 'created_at': 1707504153, 'terminated_at': None, 'status': 'started', 'variations': [{'variation_name': 'GPT 3.5 turbo', 'allocation_rate': 0.5, 'features': {'model_name': 'gpt-3.5-turbo'}}, {'variation_name': 'GPT 4', 'allocation_rate': 0.5, 'features': {'model_name': 'gpt-4'}}], 'summary': {}}


## Updating our code to use the features from the AB test

In [10]:
# Add the AB testing in our function
import random

# Get the AB testing from the API
# Include the necessary headers, such as Authorization if needed
headers = {
    "Authorization": f"Bearer {os.environ['PHOSPHO_API_KEY']}",
    "Content-Type": "application/json"
}

# Send the POST request
url = f"{PHOSPHO_BASE_URL}/v2/ab-tests/{ab_test_id}"
response = requests.get(url, headers=headers)

# Check the response
if response.status_code == 200:
    ab_test = response.json()
else:
    raise ValueError("Failed to create AB test. Status code:", response.status_code)

# Update the function
def my_answer_generation(user_prompt: str, model_name: str) -> str:

    

    completion = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": user_prompt}
    ]
    )

    answer = completion.choices[0].message.content

    phospho.log(
        input=user_prompt,
        output=answer,
    )

    return answer


## Using our new function in AB test mode

You can now serve to your users the two variants of your chatbot and compare the results.
For the purpose of this notebook, we will just shoot a few examples into into it.

In [14]:
# Create a dataset of messages to simulate the user prompts to our function in AB testing
countries = [
    "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia", "Australia", 
    "Austria", "Azerbaijan", "Bahamas", "Bahrain", "Bangladesh", "Barbados", "Belarus", "Belgium", "Belize", "Benin", 
    "Bhutan", "Bolivia", "Bosnia and Herzegovina", "Botswana", "Brazil", "Brunei", "Bulgaria", "Burkina Faso", "Burundi", 
    "Cabo Verde", "Cambodia", "Cameroon", "Canada", "Central African Republic", "Chad", "Chile", "China", "Colombia", 
    "Comoros", "Congo (Congo-Brazzaville)", "Costa Rica", "Croatia", "Cuba", "Cyprus", "Czechia (Czech Republic)", 
    "Democratic Republic of the Congo", "Denmark", "Djibouti", "Dominica", "Dominican Republic", "Ecuador", "Egypt", 
    "El Salvador", "Equatorial Guinea", "Eritrea", "Estonia", "Ethiopia", "Fiji", "Finland", 
    "France", "Gabon", "Gambia", "Georgia", "Germany", "Ghana", "Greece", "Grenada", "Guatemala", "Guinea", "Guinea-Bissau", 
    "Guyana", "Haiti", "Holy See", "Honduras", "Hungary", "Iceland", "India", "Indonesia", "Iran", "Iraq", "Ireland", 
    "Israel", "Italy", "Jamaica", "Japan", "Jordan", "Kazakhstan", "Kenya", "Kiribati", "Kuwait", "Kyrgyzstan", "Laos", 
    "Latvia", "Lebanon", "Lesotho", "Liberia", "Libya", "Liechtenstein", "Lithuania", "Luxembourg", "Madagascar", "Malawi", 
    "Malaysia", "Maldives", "Mali", "Malta", "Marshall Islands", "Mauritania", "Mauritius", "Mexico", "Micronesia", 
    "Moldova", "Monaco", "Mongolia", "Montenegro", "Morocco", "Mozambique", "Myanmar (formerly Burma)", "Namibia", "Nauru", 
    "Nepal", "Netherlands", "New Zealand", "Nicaragua", "Niger", "Nigeria", "North Korea", "North Macedonia (formerly Macedonia)", 
    "Norway", "Oman", "Pakistan", "Palau", "Palestine State", "Panama", "Papua New Guinea", "Paraguay", "Peru", 
    "Philippines", "Poland", "Portugal", "Qatar", "Romania", "Russia", "Rwanda", "Saint Kitts and Nevis", "Saint Lucia", 
    "Saint Vincent and the Grenadines", "Samoa", "San Marino", "Sao Tome and Principe", "Saudi Arabia", "Senegal", 
    "Serbia", "Seychelles", "Sierra Leone", "Singapore", "Slovakia", "Slovenia", "Solomon Islands", "Somalia", "South Africa", 
    "South Korea", "South Sudan", "Spain", "Sri Lanka", "Sudan", "Suriname", "Sweden", "Switzerland", "Syria", 
    "Taiwan", "Tajikistan", "Tanzania", "Thailand", "Timor-Leste", "Togo", "Tonga", "Trinidad and Tobago", "Tunisia", 
    "Turkey", "Turkmenistan", "Tuvalu", "Uganda", "Ukraine", "United Arab Emirates", "United Kingdom", "United States", 
    "Uruguay", "Uzbekistan", "Vanuatu", "Venezuela", "Vietnam", "Yemen", "Zambia", "Zimbabwe"
]

user_inputs = []

for country in countries:
    user_inputs.append(f"What is the capital of {country}?")

In [16]:
for user_input in user_inputs:
    
    # This is the code snippet that will be used to select the model and serve the response
    # Get the variations from the AB test
    variations = ab_test["variations"]

    # Get a random float between 0 and 1
    random_number = random.random()

    # Select the corresponding variation
    cumulated_rate = 0
    for variation in variations:
        cumulated_rate += variation["allocation_rate"]
        if random_number <= cumulated_rate:
            model_name = variation["features"]["model_name"]
            break

    # Optional: log the model name
    print(f"Using model {model_name}")

    # Now we can use the model name to call the function
    answer = my_answer_generation(user_input, model_name)

    print(answer)

Using model gpt-3.5-turbo
The capital of Afghanistan is Kabul.
Using model gpt-3.5-turbo
The capital of Albania is Tirana.
Using model gpt-4
The capital of Algeria is Algiers.
Using model gpt-3.5-turbo
The capital of Andorra is Andorra la Vella.
Using model gpt-4
The capital of Angola is Luanda.
Using model gpt-3.5-turbo
The capital of Antigua and Barbuda is St. John's.
Using model gpt-3.5-turbo
The capital of Argentina is Buenos Aires.
Using model gpt-4
The capital of Armenia is Yerevan.
Using model gpt-3.5-turbo
The capital of Australia is Canberra.
Using model gpt-4
The capital of Austria is Vienna.
Using model gpt-3.5-turbo
The capital city of Azerbaijan is Baku.
Using model gpt-3.5-turbo
The capital of the Bahamas is Nassau.
Using model gpt-3.5-turbo
The capital of Bahrain is Manama.
Using model gpt-3.5-turbo
The capital of Bangladesh is Dhaka.
Using model gpt-3.5-turbo
The capital of Barbados is Bridgetown.
Using model gpt-3.5-turbo
The capital of Belarus is Minsk.
Using model gp

KeyboardInterrupt: 

## Get the results of the AB test

In [20]:
# Include the necessary headers, such as Authorization if needed
headers = {
    "Authorization": f"Bearer {os.environ['PHOSPHO_API_KEY']}",
    "Content-Type": "application/json"
}

# Send the POST request
url = f"{PHOSPHO_BASE_URL}/v2/projects/{os.environ['PHOSPHO_PROJECT_ID']}/ab-tests/results"
response = requests.get(url, headers=headers)

# Check the response
if response.status_code == 200:
    print("AB Test results:", response.json())
    ab_test_result = response.json()
else:
    print("Failed to create AB test. Status code:", response.status_code)
    print("Response:", response.text)

Failed to create AB test. Status code: 500
Response: Internal Server Error
