In [3]:
%pip install openai pandas --upgrade



In [4]:
import getpass
import ipywidgets as widgets
from IPython.display import display
import pandas as pd

In [5]:
# Get the openai secret key:
secret_key = getpass.getpass('Please enter your openai key: ')

Please enter your openai key: ··········


In [6]:
# Define two variants of the prompt
prompt_A = """Product description: A pair of shoes that can fit any foot size.
Seed words: adaptable, fit, omni-fit.
Product names:"""

prompt_B = """Product description: A home milkshake maker.
Seed words: fast, healthy, compact.
Product names: HomeShaker, Fit Shaker, QuickShake, Shake Maker

Product description: A watch that can tell accurate time in space.
Seed words: astronaut, space-hardened, eliptical orbit
Product names: AstroTime, SpaceGuard, Orbit-Accurate, EliptoTime.

Product description: A pair of shoes that can fit any foot size.
Seed words: adaptable, fit, omni-fit.
Product names:"""

test_prompts = [prompt_A, prompt_B]

import pandas as pd
from openai import OpenAI

client = OpenAI(api_key=secret_key)

def get_response(prompt):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
    )
    return response.choices[0].message.content

# Iterate through the prompts and get responses
test_prompts = [prompt_A, prompt_B]
responses = []
num_tests = 5

for idx, prompt in enumerate(test_prompts):
    # prompt number as a letter
    var_name = chr(ord('A') + idx)

    for i in range(num_tests):
        # Get a response from the model
        response = get_response(prompt)

        data = {
            "variant": var_name,
            "prompt": prompt,
            "response": response
            }
        responses.append(data)

# Convert responses into a DataFrame
df = pd.DataFrame(responses)

# Save the DataFrame as a CSV file
df.to_csv("responses.csv", index=False)

print(df)

  variant                                             prompt  \
0       A  Product description: A pair of shoes that can ...   
1       A  Product description: A pair of shoes that can ...   
2       A  Product description: A pair of shoes that can ...   
3       A  Product description: A pair of shoes that can ...   
4       A  Product description: A pair of shoes that can ...   
5       B  Product description: A home milkshake maker.\n...   
6       B  Product description: A home milkshake maker.\n...   
7       B  Product description: A home milkshake maker.\n...   
8       B  Product description: A home milkshake maker.\n...   
9       B  Product description: A home milkshake maker.\n...   

                                            response  
0  1. AdaptiFit Shoes\n2. OmniFit Footwear\n3. Pe...  
1  1. FlexiFit Shoes\n2. FitAll Footwear\n3. Omni...  
2  1. OmniFleX Shoes\n2. FlexiFit Footwear\n3. Ad...  
3  1. OmniFit Shoes\n2. AdaptaFit Footwear\n3. Si...  
4  1. OmniFit Footwe

This code defines two variants of prompts (A and B) and sends them to OpenAI's GPT-3.5 model for generating responses.

### 1. **Prompt Definitions (`prompt_A` and `prompt_B`)**:
   - **`prompt_A`**: A product description for shoes that can fit any foot size, along with seed words that suggest key product features. The model is expected to generate product names based on the description.
   - **`prompt_B`**: This variant contains descriptions of two different products:
     - A home milkshake maker.
     - A watch that can tell accurate time in space.
     - Then, the description of the same shoes as in `prompt_A`. Seed words are also provided to guide the product name generation.

   These prompts are stored in a list `test_prompts`.

### 2. **Response Fetching**:
   - **`client = OpenAI(api_key=secret_key)`**: Initializes the OpenAI client using an API key (assumed to be stored securely in `secret_key`).
   - **`get_response(prompt)`**: This function sends a `prompt` to the OpenAI API, specifying the model `gpt-3.5-turbo`. The function uses the `"system"` role to set up the model's assistant persona, and the `"user"` role to send the actual product description prompt. It returns the first generated response.

### 3. **Running the Test**:
   - **Loop over `test_prompts`**: This loop iterates through the prompt variants (A and B).
     - **`var_name = chr(ord('A') + idx)`**: Converts the index `idx` to a corresponding letter ('A' for the first variant, 'B' for the second) for labeling purposes.
   - **Nested Loop for Testing**:
     - For each variant of the prompt (`prompt_A`, `prompt_B`), the inner loop runs `num_tests` times (in this case, 5 times).
     - The **`get_response(prompt)`** function is called to get a response from the model.
     - Each result is stored in a dictionary with the variant, the prompt, and the response.

### 4. **Saving and Output**:
   - **DataFrame Creation**: The collected responses are converted into a pandas DataFrame.
   - **CSV Export**: The DataFrame is saved as a CSV file named `"responses.csv"`, which will store all responses from the model for both prompts and multiple test runs.
   - **Print the DataFrame**: Finally, the DataFrame is printed to display the results in a tabular format.

### Purpose:
The code is performing A/B testing on two different prompt variants by running the GPT-3.5 model with each variant multiple times. The results are collected in a CSV file for further analysis, allowing a comparison of how different prompt styles or variations affect the model’s output.

### Example Responses Structure:
- Variant A: Responses generated from `prompt_A`.
- Variant B: Responses generated from `prompt_B`.

Each variant allows the model to generate potential product names based on the descriptions provided in the prompts.

In [7]:
# Load the responses.csv file:
df = pd.read_csv("responses.csv")

# Shuffle the DataFrame
df = df.sample(frac=1).reset_index(drop=True)

# Assuming df is your DataFrame and 'response' is the column with the text you want to test
response_index = 0
df["feedback"] = pd.Series(dtype="str")  # add a new column to store feedback

response = widgets.HTML()
count_label = widgets.Label()

def update_response():
    new_response = df.iloc[response_index]["response"]
    new_response = (
        "<p>" + new_response + "</p>"
        if pd.notna(new_response)
        else "<p>No response</p>"
    )
    response.value = new_response
    count_label.value = f"Response: {response_index + 1} / {len(df)}"


def on_button_clicked(b):
    global response_index
    #  convert thumbs up / down to 1 / 0
    user_feedback = 1 if b.description == "👍" else 0

    # update the feedback column
    df.at[response_index, "feedback"] = user_feedback

    response_index += 1
    if response_index < len(df):
        update_response()
    else:
        # save the feedback to a CSV file
        df.to_csv("results.csv", index=False)

        print("A/B testing completed. Here's the results:")
        # Calculate score for each variant and count the number of rows per variant
        summary_df = (
            df.groupby("variant")
            .agg(count=("feedback", "count"), score=("feedback", "mean"))
            .reset_index()
        )
        print(summary_df)

The code is implementing an interactive feedback system for A/B testing the responses generated by an AI model.

### 1. **Loading and Shuffling Data**
```python
df = pd.read_csv("responses.csv")
df = df.sample(frac=1).reset_index(drop=True)
```
- The first step reads the CSV file (`responses.csv`) into a pandas DataFrame (`df`) and shuffles the data randomly using `sample(frac=1)`. This ensures that the feedback system presents responses in a random order, rather than the order they were saved.

### 2. **Setting Up HTML Widgets**
```python
response = widgets.HTML()
count_label = widgets.Label()
```
- Two widgets are created:
  - `response` (an HTML widget) will display each AI-generated response from the DataFrame.
  - `count_label` will show the current response number out of the total number of responses.

### 3. **Updating Responses**
```python
def update_response():
    new_response = df.iloc[response_index]["response"]
    new_response = (
        "<p>" + new_response + "</p>"
        if pd.notna(new_response)
        else "<p>No response</p>"
    )
    response.value = new_response
    count_label.value = f"Response: {response_index + 1} / {len(df)}"
```
- This function, `update_response()`, retrieves the next response from the DataFrame based on the `response_index` (which is initially set to 0) and updates the HTML widget (`response`) to display the current response.
- It also updates the `count_label` to show the progress through the responses (e.g., "Response: 1/100").

### 4. **Handling Button Clicks**
```python
def on_button_clicked(b):
    global response_index
    user_feedback = 1 if b.description == "👍" else 0
    df.at[response_index, "feedback"] = user_feedback

    response_index += 1
    if response_index < len(df):
        update_response()
    else:
        df.to_csv("results.csv", index=False)
        print("A/B testing completed. Here's the results:")
        summary_df = (
            df.groupby("variant")
            .agg(count=("feedback", "count"), score=("feedback", "mean"))
            .reset_index()
        )
        print(summary_df)
```
- When a user clicks one of the feedback buttons (either 👍 for positive feedback or 👎 for negative feedback):
  - It checks the button description (`b.description`). If the button is a thumbs-up, it records a `1`, otherwise a `0`, in the DataFrame's `feedback` column.
  - The `response_index` is then incremented, moving to the next response.
  - If the system has gone through all the responses, the feedback is saved to `results.csv`, and a summary of the feedback is printed. The summary includes the number of responses for each variant (`A` or `B`) and the average feedback score (where 1 represents positive feedback).

### 5. **Summary Calculation**
```python
summary_df = (
    df.groupby("variant")
    .agg(count=("feedback", "count"), score=("feedback", "mean"))
    .reset_index()
)
```
- The DataFrame is grouped by the `variant` (i.e., A or B) and aggregated to calculate:
  - **`count`**: The number of feedback entries for each variant.
  - **`score`**: The average feedback score, giving a sense of which variant performed better.
  
The results are then printed to show the performance of each prompt variant based on the feedback collected during the A/B test.

---

In [8]:
update_response()

thumbs_down_button = widgets.Button(description="👎")
thumbs_down_button.on_click(on_button_clicked)

thumbs_up_button = widgets.Button(description="👍")
thumbs_up_button.on_click(on_button_clicked)


button_box = widgets.HBox(
    [
        thumbs_up_button,
        thumbs_down_button,
    ]
)

# After clicking it 10 times, then click it once more to display
display(response, button_box, count_label)

HTML(value='<p>1. OmniFleX Shoes\n2. FlexiFit Footwear\n3. AdaptaSole Sneakers\n4. VersaFit Shoes\n5. OmniStep…

HBox(children=(Button(description='👍', style=ButtonStyle()), Button(description='👎', style=ButtonStyle())))

Label(value='Response: 1 / 10')

The code provided implements an interactive A/B testing interface using Jupyter widgets.
### 1. **Update Response Function**
```python
def update_response():
    new_response = df.iloc[response_index]["response"]
    new_response = (
        "<p>" + new_response + "</p>"
        if pd.notna(new_response)
        else "<p>No response</p>"
    )
    response.value = new_response
    count_label.value = f"Response: {response_index + 1} / {len(df)}"
```
- This function updates the current response shown in the `response` HTML widget based on the `response_index` from the DataFrame (`df`).
- If the response exists (`pd.notna(new_response)`), it wraps it in a `<p>` tag for HTML formatting.
- It also updates `count_label` to show which response number the user is viewing, out of the total number of responses.

### 2. **Thumbs Up and Thumbs Down Buttons**
```python
thumbs_down_button = widgets.Button(description="👎")
thumbs_down_button.on_click(on_button_clicked)

thumbs_up_button = widgets.Button(description="👍")
thumbs_up_button.on_click(on_button_clicked)
```
- Two buttons are created: one for a thumbs-down response and another for a thumbs-up response.
- The `on_button_clicked` function is attached to each button's `on_click` event, so that when a button is clicked, it records feedback and moves to the next response.

### 3. **Arranging Buttons in a Horizontal Box**
```python
button_box = widgets.HBox([thumbs_up_button, thumbs_down_button])
```
- The two buttons are arranged horizontally (`HBox`) so that they appear side by side in the interface.

### 4. **Displaying the Widgets**
```python
display(response, button_box, count_label)
```
- Finally, the `response` (HTML widget), `button_box` (containing the buttons), and `count_label` are displayed in the Jupyter notebook.
- The interface will show the current response, buttons for feedback, and a label indicating how many responses have been reviewed so far.

### Workflow
1. The user sees the response generated by the AI.
2. They provide feedback by clicking either the thumbs-up or thumbs-down button.
3. Once feedback is provided, the next response is displayed, and the process repeats until all responses are evaluated.