# Azure AI Models-as-a-Service (MaaS) Chatbot Arena

## Imports and Initialization

To ensure that all necessary libraries and packages are installed for the project, we have created a `requirements.txt` file.

To install the dependencies listed in the `requirements.txt` file, run the following command in your terminal:

```sh
pip install -r requirements.txt

In [None]:
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
import gradio as gr
import os
from datasets import load_dataset
import pandas as pd
from maas_clients import initialize_clients

## Data Loading and Pre-processing

The MMLU (Massive Multitask Language Understanding) dataset is a comprehensive benchmark designed to evaluate the performance of language models across a wide range of tasks and subjects. It includes questions from various domains such as humanities, social sciences, STEM, and more, making it a diverse and challenging dataset for assessing the generalization capabilities of language models.

To load the MMLU dataset, we use the `load_dataset` function from the `datasets` library. This dataset includes various splits such as 'test', 'validation', and 'dev'. We extract a distinct list of all subjects in the dataset, representing the different domains and topics covered by the MMLU dataset.

Examples are generated based on the selected subjects. An interactive interface (e.g., using Gradio) is set up to allow users to input questions, select models, and view the responses in real-time. This interface also includes options to clear the chat history, generate new examples, and adjust model parameters like temperature and max tokens.

By following these steps, the MMLU dataset is effectively used to generate a wide range of examples, enabling a comprehensive comparison of the output, quality, and performance of different LLMs side by side.

In [None]:
# Load the MMLU dataset from the "cais/mmlu" repository
ds = load_dataset("cais/mmlu", "all")

# Convert the 'test', 'validation', and 'dev' splits of the dataset into DataFrames
test_df = pd.DataFrame(ds['test'])
validation_df = pd.DataFrame(ds['validation'])
dev_df = pd.DataFrame(ds['dev'])

# Optionally, concatenate all DataFrames into one combined DataFrame
combined_df = pd.concat([test_df, validation_df, dev_df], ignore_index=True)

# Extract a distinct list of all subjects in the 'subject' column and store it in a variable
subjects_array = combined_df['subject'].unique()

# Convert the array of subjects to a list
subjects = subjects_array.tolist()

## Client Initialization

The code snippet initializes clients for various language models using the `initialize_clients` function. This function returns a dictionary where each key corresponds to a different language model client. These clients can be accessed using their respective keys from the dictionary.

The `maas_clients.py` file is a vital part of our project, responsible for setting up and managing these language model clients. It contains the `initialize_clients` function, which establishes connections to different language models and returns a dictionary of clients. Each client in the dictionary represents a specific language model, enabling seamless interaction with multiple models.

This setup simplifies access and interaction with various language models, making it easier to generate responses, compare model outputs, and evaluate performance across different models.

In [None]:
clients = initialize_clients()

# Access clients like this
gpt_4o_client = clients["gpt_4o_client"]
gpt_4_turbo_client = clients["gpt_4_turbo_client"]
jamba_instruct_client = clients["jamba_instruct_client"]
command_r_client = clients["command_r_client"]
command_r_plus_client = clients["command_r_plus_client"]
jais_30b_client = clients["jais_30b_client"]
llama_3_1_405B_client = clients["llama_3_1_405B_client"]
llama_3_1_70B_client = clients["llama_3_1_70B_client"]
llama_3_1_8B_client = clients["llama_3_1_8B_client"]
mistral_large_client = clients["mistral_large_client"]
mistral_large_2407_client = clients["mistral_large_2407_client"]
mistral_nemo_client = clients["mistral_nemo_client"]
phi_3_medium_128k_instruct_client = clients["phi_3_medium_128k_instruct_client"]
phi_3_small_128k_instruct_client = clients["phi_3_small_128k_instruct_client"]
phi_3_mini_4k_instruct_client = clients["phi_3_mini_4k_instruct_client"]

## CSS and HTML
In this section of the notebook, we set up the visual and structural components necessary for creating an interactive interface for the Azure AI Models-as-a-Service (MaaS) Arena. This involves defining CSS for styling, and creating an HTML title.

In [None]:
css = """
h1 {
  margin: 0;
  flex-grow: 1;
  font-size: 24px;
  min-width: 200px;
}
"""

title = """<h1 style="text-align: center;">Welcome to the Azure AI Models-as-a-Service (MaaS) Arena</h1>"""

## Inference and model selection functions
This section of the notebook provides functions for handling inference and model selection for chatbot interactions. The key components include:

1. `user_model` Function:
    - This function takes a user message and the current chat history as inputs.
    - It returns an empty string and updates the chat history by appending the user message with a placeholder for the assistant's response.
2. `chat_model_1` and `chat_model_2` Functions:
    - Both functions are designed to handle the chatbot's response generation.
    - They take the chat history, temperature, maximum tokens, and model name as inputs.
    - They select the appropriate model client using the `select_model` function.
    - They convert the chat history into a format suitable for the model.
    - They generate responses from the model in a streaming fashion, updating the chat history with the assistant's responses.
3. `select_model` Function:
    - This function maps the provided model name to the corresponding model client.
    - It supports various models such as GPT-4o, GPT-4 Turbo, AI21 Jamba-Instruct, and others.
    - If an unknown model name is provided, it raises a ValueError.

These functions collectively enable the notebook to perform dynamic model selection and generate responses for chatbot interactions, facilitating experimentation with different models and configurations.

In [None]:
# Inference functions for Chatbots 
def user_model(user_message, chat_history):
    return "", chat_history + [[user_message, None]]

# Modified chat_model_1
def chat_model_1(chat_history, temp, max_tokens, model_name):
    selected_client = select_model(model_name)
    chat_history[-1][1] = ""

    # Convert chat history to message dictionaries
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    for user, assistant in chat_history[:-1]:
        messages.append({"role": "user", "content": user})
        if assistant is not None:
            messages.append({"role": "assistant", "content": assistant})
    messages.append({"role": "user", "content": chat_history[-1][0]})

    response = selected_client.complete(
        stream=True,
        messages=messages,
        temperature=temp,
        max_tokens=max_tokens
    )

    for update in response:
        if update.choices and update.choices[0].delta.content is not None:
            chat_history[-1][1] += update.choices[0].delta.content or ""
            yield chat_history

    yield chat_history

# Modified chat_model_2
def chat_model_2(chat_history, temp, max_tokens, model_name):
    selected_client = select_model(model_name)
    chat_history[-1][1] = ""

    # Convert chat history to message dictionaries
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    for user, assistant in chat_history[:-1]:
        messages.append({"role": "user", "content": user})
        if assistant is not None:
            messages.append({"role": "assistant", "content": assistant})
    messages.append({"role": "user", "content": chat_history[-1][0]})

    response = selected_client.complete(
        stream=True,
        messages=messages,
        temperature=temp,
        max_tokens=max_tokens
    )

    for update in response:
        if update.choices and update.choices[0].delta.content is not None:
            chat_history[-1][1] += update.choices[0].delta.content or ""
            yield chat_history

    yield chat_history

def select_model(model_name):
    if model_name == "GPT-4o":
        return gpt_4o_client
    elif model_name == "GPT-4 Turbo":
        return gpt_4_turbo_client
    elif model_name == "AI21 Jamba-Instruct":
        return jamba_instruct_client
    elif model_name == "Cohere Command R":
        return command_r_client
    elif model_name == "Cohere Command R+":
        return command_r_plus_client
    elif model_name == "Jais 30B":
        return jais_30b_client
    elif model_name == "Llama3.1 405B":
        return llama_3_1_405B_client
    elif model_name == "Llama3.1 70B":
        return llama_3_1_70B_client
    elif model_name == "Llama3.1 8B":
        return llama_3_1_8B_client
    elif model_name == "Mistral-Large":
        return mistral_large_client
    elif model_name == "Mistral-Large-2407":
        return mistral_large_2407_client
    elif model_name == "Mistral-Nemo":
        return mistral_nemo_client
    elif model_name == "Phi-3-medium-128k":
        return phi_3_medium_128k_instruct_client
    elif model_name == "Phi-3-small-128k":
        return phi_3_small_128k_instruct_client
    elif model_name == "Phi-3-mini-4k":
        return phi_3_mini_4k_instruct_client
    else:
        raise ValueError(f"Unknown model name: {model_name}")

## MMLU Example Handling Functions and Gradio Interface

This section of the notebook sets up an interactive Gradio interface for experimenting with different chatbot models and configurations. The key components include:

1. **Dropdowns for Model Selection**:
    - Two dropdowns (`model_dropdown1` and `model_dropdown2`) allow users to select different models for comparison.
    - The available models include GPT-4o, GPT-4 Turbo, AI21 Jamba-Instruct, Cohere Command R, and others.

2. **Chatbot Interfaces**:
    - Two chatbot interfaces (`chatbot1` and `chatbot2`) display the conversation history for the selected models.
    - Each chatbot has a placeholder indicating no messages yet and a label showing the selected model.

3. **User Input and Control Buttons**:
    - A textbox (`user_msg`) for users to input their messages.
    - A submit button (`submit_button`) to send the message.
    - A clear button (`clear_button`) to clear the chat history.

4. **Additional Parameters**:
    - An accordion (`additional_inputs_accordion`) for additional settings like temperature and maximum tokens.
    - A slider (`temperature`) to adjust the temperature of the model.
    - A slider (`max_tokens`) to set the maximum number of tokens for the model's response.

5. **Subject and Example Generation**:
    - A dropdown (`subject`) to select a subject for generating example questions.
    - Buttons (`generate`, `random_button`, `clear`) to generate examples, generate random examples, and revert to cached examples, respectively.
    - An examples component (`examples`) to display and select example questions.

6. **Event Handling**:
    - The `generate` button updates the examples based on the selected subject.
    - The `random_button` generates random examples based on the selected subject.
    - The `clear` button reverts to the cached examples.
    - Clicking on an example updates the user message textbox.

7. **Model Selection and Chatbot Update**:
    - Selecting a model from the dropdown updates the corresponding chatbot's label.
    - The `user_msg` textbox and `submit_button` handle user input and trigger the chatbot response generation using the selected model.

8. **Launching the Interface**:
    - The Gradio interface is launched with `demo.launch(debug=True)` to enable debugging and interaction.

This setup allows users to interactively test and compare different chatbot models, adjust parameters, and generate example questions for various subjects.

In [None]:
def update_chatbot(select_data):
    return gr.update(label=f"{select_data}")

def update_max_tokens(model_name):
    if model_name == "Phi-3-mini-4k":
        return gr.update(maximum=2048, value=min(2048, max_tokens.value))
    else:
        return gr.update(maximum=4096, value=min(4096, max_tokens.value))
    
def generate_subject_examples(subject):
    # Filter the DataFrame to only include rows where the subject matches the input
    subject_df = test_df[test_df['subject'] == subject]
    # Convert the selected rows to a list of lists
    question_list = [[question] for question in subject_df['question'].head().values.tolist()]
    # Return the list of lists
    return question_list

def generate_subject_random_examples(subject):
    # Filter the DataFrame to only include rows where the subject matches the input
    subject_df = test_df[test_df['subject'] == subject]
    # Generate a random sample of five rows from the filtered DataFrame
    question_list = subject_df['question'].sample(n=5).tolist()  # Use random_state for reproducibility
    # Convert the selected rows to a list of lists
    question_list = [[question] for question in question_list]
    # Return the list of lists
    return question_list

cached_examples = [
    ["There's a llama in my garden 😱 What should I do?"],
    ["What is the best way to open a can of worms?"],
    ["The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1."],
    ['How to setup a human base on Mars? Give short answer.'],
    ['Explain theory of relativity to me like I’m 8 years old.'],
    ['What is 9,000 * 9,000?'],
    ['Write a pun-filled happy birthday message to my friend Alex.'],
    ['Justify why a penguin might make a good king of the jungle.'],
    ['Give me 5 good reasons why I should exercise every day.'],
    ['How many languages are in the world?']
]

# Function to handle example click
def display_example(example):
    return example[1][0]  # Return the first element of the example list

# Function to update examples
def update_examples(subject):
    new_examples = generate_subject_examples(subject)
    examples.dataset.samples = new_examples
    return gr.Dataset(samples=new_examples)

# Function to update random examples
def update_random_examples(subject):
    new_examples = generate_subject_random_examples(subject)
    examples.dataset.samples = new_examples
    return gr.Dataset(samples=new_examples)

# Function to revert to cached examples
def revert_to_cached_examples():
    examples.dataset.samples = cached_examples
    return gr.Dataset(samples=cached_examples)

# Gradio Blocks
with gr.Blocks(css=css) as demo:
    gr.HTML(title)

    with gr.Row():
        model_dropdown1 = gr.Dropdown(
            choices=[
                "GPT-4o", "GPT-4 Turbo", "AI21 Jamba-Instruct", "Cohere Command R", "Cohere Command R+", 
                "Jais 30B", "Llama3.1 405B", "Llama3.1 70B", "Llama3.1 8B", "Mistral-Large", 
                "Mistral-Large-2407", "Mistral-Nemo", "Phi-3-medium-128k", "Phi-3-small-128k", "Phi-3-mini-4k"
            ],
            label="Model A",
            value="Llama3.1 405B"
        )
        model_dropdown2 = gr.Dropdown(
            choices=[
                "GPT-4o", "GPT-4 Turbo", "AI21 Jamba-Instruct", "Cohere Command R", "Cohere Command R+", 
                "Jais 30B", "Llama3.1 405B", "Llama3.1 70B", "Llama3.1 8B", "Mistral-Large", 
                "Mistral-Large-2407", "Mistral-Nemo", "Phi-3-medium-128k", "Phi-3-small-128k", "Phi-3-mini-4k"
            ],
            label="Model B",
            value="Mistral-Large"
        )

    with gr.Row():
        chatbot1 = gr.Chatbot(
            placeholder="No messages yet", 
            label="Llama3.1 405B"
        )
        chatbot2 = gr.Chatbot(
            placeholder="No messages yet", 
            label="Mistral-Large"
        )

    with gr.Row():
        user_msg = gr.Textbox(placeholder="Ask me anything", label="User Messages", scale=7)
        submit_button = gr.Button("Send", variant="primary")   
        clear_button = gr.Button("Clear")

    additional_inputs_accordion = gr.Accordion(label="⚙️ Parameters", open=True)
    with additional_inputs_accordion:
        temperature = gr.Slider(minimum=0, maximum=1, step=0.1, value=0.90, label="Temperature")
        max_tokens = gr.Slider(minimum=128, maximum=4096, step=128, value=2048, label="Max tokens")

        # Row for subject and generate button
        with gr.Row():
            with gr.Column():
                with gr.Row():
                    # Subject dropdown and generate button
                    subject = gr.Dropdown(choices=subjects, label="Subject")
                with gr.Row():
                    generate = gr.Button(value="Generate", variant="secondary")
                    random_button = gr.Button(value="Random", variant='secondary')
                    clear = gr.Button(value="Clear", variant="secondary")
            with gr.Column(scale=3):
                # Examples component
                examples = gr.Examples(examples=cached_examples, inputs=user_msg, label="Examples")           

        # Update the examples when a new subject is selected and the generate button is clicked
        generate.click(fn=update_examples, inputs=[subject], outputs=examples.dataset)
        
        # Update the examples when the random button is clicked
        random_button.click(fn=update_random_examples, inputs=[subject], outputs=examples.dataset)

        # Revert to cached examples when clear button is clicked
        clear.click(fn=revert_to_cached_examples, inputs=None, outputs=examples.dataset)
        
        # Set up the example click event to update the output textbox
        examples.dataset.click(fn=display_example, inputs=examples.dataset, outputs=user_msg)
            
    model_dropdown1.select(
        fn=update_chatbot,
        inputs=[model_dropdown1], 
        outputs=[chatbot1],
        scroll_to_output=True,
        show_progress="minimal"
    )

    model_dropdown2.select(
        fn=update_chatbot,
        inputs=[model_dropdown2], 
        outputs=[chatbot2],
        scroll_to_output=True,
        show_progress="minimal"
    )

    # handle the case where the user presses enter instead of clicking the submit button

    user_msg.submit(user_model, [user_msg, chatbot1], [user_msg, chatbot1], queue=False).then(
        chat_model_1, [chatbot1, temperature, max_tokens, model_dropdown1], [chatbot1]
    )
    
    user_msg.submit(user_model, [user_msg, chatbot2], [user_msg, chatbot2], queue=False).then(
        chat_model_2, [chatbot2, temperature, max_tokens, model_dropdown2], [chatbot2]
    )

    submit_button.click(user_model, [user_msg, chatbot1], [user_msg, chatbot1], queue=False).then(
        chat_model_1, [chatbot1, temperature, max_tokens, model_dropdown1], [chatbot1]
    )

    submit_button.click(user_model, [user_msg, chatbot2], [user_msg, chatbot2], queue=False).then(
        chat_model_2, [chatbot2, temperature, max_tokens, model_dropdown2], [chatbot2]
    )

    clear_button.click(lambda: None, None, chatbot1, queue=False)
    clear_button.click(lambda: None, None, chatbot2, queue=False)

if __name__ == "__main__":
    demo.launch(debug=True)

## Conclusion

In this notebook, we explored various chatbot models and their configurations using an interactive Gradio interface. We provided a comprehensive setup that allows users to:

- Select and compare different chatbot models.
- Adjust parameters such as temperature and maximum tokens to fine-tune the model responses.
- Generate and test example questions across different subjects.

Through this interactive approach, users can gain insights into the performance and behavior of different models, making it easier to choose the most suitable one for their specific needs. The flexibility and ease of use provided by the Gradio interface enhance the experimentation process, enabling a deeper understanding of the capabilities and limitations of each model.

Overall, this notebook serves as a valuable tool for anyone looking to experiment with and evaluate various chatbot models in a user-friendly and interactive manner.