# Mixture of Agents - AI Makerspace Event

In today's notebook we're going to be exploring Together AI's [Mixture of Agents](https://arxiv.org/abs/2406.04692).

### What is Mixture-of-Agents?

In broad strokes, Mixture-of-Agents (MoA) can be reduced to the following simple idea:

> Considering multiple responses when generating a final response tends to lead to a higher quality response.

### How does it work?

For each MoA system there exist three main components:

1. Proposer (or Reference) models - these models propose responses to a user's prompt.
2. Aggregators - these models take the proposed outputs, along with a system prompt and the user's prompt to synthesize a response.
3. Layers - each MoA system has a number of layers that corresponds to the number of proposal stages and aggregation stages.

Let's look at a diagram to better understand layers:


![image](https://i.imgur.com/OYRhF8R.png)

The basic idea is that each layer will accept some intermediate response from the previous layer, and then output the response set generated to the next layer. This shares similarities with how Deep Learning networks are constructed, albeit at a more "zoomed out" level.

Let's dive into the notebook to see how this works in practice!

## Dependencies and API Keys

We'll be using Together AI's platform today - which will require an API key to get started!

You can follow the process outlined at step 1 [here](https://docs.together.ai/docs/quickstart#1-register-for-an-account) to obtain an API key.

> NOTE: This notebook can be executed with the free \$5 given to new Together AI accounts. This notebook will consume ~$0.01 credits total. Details about pricing are available [here](https://www.together.ai/pricing).

In [None]:
import os
import getpass

os.environ["TOGETHER_API_KEY"] = getpass.getpass("Together API Key:")

Together API Key:··········


Next, we'll install our necessary dependencies.

In [None]:
!pip install -qU together openai pyarrow

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.0/337.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not current

We need to run some `asyncio` boilerplate to ensure the notebook can run.

In [None]:
import nest_asyncio
nest_asyncio.apply()

We will also need to make sure we have set up our Together AI client so that we can interact with the LLMs hosted through their service.

In [None]:
import asyncio
import os
import together
from together import AsyncTogether, Together

client = Together(api_key=os.environ.get("TOGETHER_API_KEY"))
async_client = AsyncTogether(api_key=os.environ.get("TOGETHER_API_KEY"))

## Setting up Simple MoA

In the following section of the notebook, we'll set up a simple example of the MoA system following Together AI's example from their [GitHub repository](https://github.com/togethercomputer/MoA/blob/main/moa.py).

This will be a two layer example - with a proposal layer, and a aggregation layer.

Let's check out the diagram:

![image](https://i.imgur.com/Ko3Jsil.png)

So we'll need to build out the functionality for our MoA system as follows:

1. The first layer, which should accept a user prompt, and respond with each Proposal LLM's response to that user prompt.
2. A step that concatenates those responses.
3. A second layer which augments the system prompt with the outputs of Layer #1 and returns the final Aggregate LLM's response.

### Proposal Models

We can select any number of models here (though cost of each run is directly related to how many models are in each of our layers).

You can find a detailed list of available models [here](https://docs.together.ai/docs/chat-models).

> NOTE: Keep in mind that, through their research, Together AI noticed that certain models excel in certain roles - which is something to keep in mind when constructing a MoA. Experimentation is encouraged!

In [None]:
proposal_models = [
    "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    "google/gemma-2-27b-it",
    "zero-one-ai/Yi-34B-Chat",
    "deepseek-ai/deepseek-llm-67b-chat"
]

### Aggregator Layer

We'll use a moderately powerful model for our Aggregator, given the task of synthesizing a high quality response.

In [None]:
aggregator_model = "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"

You'll notice that in the aggregation layer we have a system prompt - this prompt is what provides the instructions to synthesize the previous responses.

> NOTE: This prompt is not "set in stone" and can, and should, be experimented with - especially depending on the Aggregator used.

In [None]:
aggreagator_system_prompt = """\
You have been provided with a set of responses from various open-source models to the latest user query. \
Your task is to synthesize these responses into a single, high-quality response. \
It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. \
Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. \
Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.

Responses from models:"""

### Model Calling Helper Function

We then build a helper function to call the Together API.

We'll be using good netiquette and making sure we use exponential backoff for our API calls in the case that we're rate limited.

Also - you'll notice that we're providing a few values that could be changed or experimented with:

- `temperature` - while the example uses a moderately low temperature, you could experiment with higher or lower to see the overall impact on the final response.
- `max_tokens` - we're setting the max response tokens to a reasonable size for this example, though you could try a larger or smaller value to see how it impacted the final response.

> NOTE: We are using `run_llm` to run *both* the proposal models and the aggregator model - but it might be prudent or useful to separate the two based on your preferences.

In [None]:
async def run_llm(model, user_prompt):
    """Run a single LLM call with a reference model."""
    for sleep_time in [1, 2, 4]:
        try:
            response = await async_client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": user_prompt}],
                temperature=0.7,
                max_tokens=512,
            )
            break
        except together.error.RateLimitError as e:
            print(e)
            await asyncio.sleep(sleep_time)
    return response.choices[0].message.content

### Constructing MoA

Now we can construct our two layer MoA!

> NOTE: While we've been using words like "layers" - there isn't yet a specific implementation or framework that abstracts the layers in a familiar pattern like PyTorch, or otherwise, so this is more or less just a way to think about the construction of the MoA system for now.

In [None]:
async def two_layer_moa(user_prompt):
    ### Layer #1: Run each Proposal Model in Layer #1 and collect the results.
    results = await asyncio.gather(*[run_llm(model, user_prompt) for model in proposal_models])

    ### Intermediate Step: Concatenate the results from Layer #1 to be used as input for Layer #2
    concatenated_results = "\n".join([f"{i+1}. {str(element)}" for i, element in enumerate(results)])

    ### Layer #2: Use the concatenated results as input, provide the aggregator system prompt, and call the aggregator model.
    finalStream = client.chat.completions.create(
        model=aggregator_model,
        messages=[
            {"role": "system", "content": aggreagator_system_prompt + "\n" + concatenated_results},
            {"role": "user", "content": user_prompt},
        ],
        stream=True,
    )

    # Stream the output
    for chunk in finalStream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

Nothing left to do but run our `two_layer_moa` now!

> NOTE: You may see `Error code: 429 ...` this is expected behaviour! We built in a backoff in our `run_llm` helper function to make sure we're using the Together AI API responsibly.

In [None]:
user_prompt = "What is the meaning of life?"

asyncio.run(two_layer_moa(user_prompt))

Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
The question of the meaning of life is one of the most profound and elusive inquiries humanity has ever pondered. It has been debated by

## Setting Up Multi-Layer Proposal MoA

Now that we've constructed the simple MoA system, let's extend it to a multi-layer proposal MoA.

In essence, this will be the same process - but we will have the ability to include more layers.

In the following example - we'll look at constructing the original system as shown here:

![image](https://i.imgur.com/OYRhF8R.png)

Let's dig right in!

### Intermediate State

Due to the fact that we must collect and concatenate responses with the user query between each layer - we'll build a simple helper function to accomplish this.

This helper function will take the aggregator system prompt and construct a single useful prompt by concatenating all the other responses together.

> NOTE: You might notice that *technically* the aggregation response is being included in each proposal model's prompt past the first layer - and this is correct. However, even though aggregation is *occuring* in Layers #2+, the responses are still only proposals until the final aggregation step occurs.

In [None]:
def collect_responses(system_prompt, results):
    """Construct a system prompt for layers 2+ that includes the previous responses to synthesize."""
    return (
        system_prompt
        + "\n"
        + "\n".join([f"{i+1}. {str(element)}" for i, element in enumerate(results)])
    )

### Model Calling Helper Function

Once again, we're going to create a helper function to call our Together AI API hosted models.

This time, however, we will need to accomodate the fact that there might be previous responses to consider. If there are, we will leverage the `collect_responses` - if not, we will just use the user prompt.

In [None]:
async def run_llm(model, user_prompt, prev_response=None):
    """Run a single LLM call with a model while accounting for previous responses + rate limits."""
    for sleep_time in [1, 2, 4]:
        try:
            messages = (
                [
                    {
                        "role": "system",
                        "content": collect_responses(
                            aggreagator_system_prompt, prev_response
                        ),
                    },
                    {"role": "user", "content": user_prompt},
                ]
                if prev_response
                else [{"role": "user", "content": user_prompt}]
            )
            response = await async_client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0.7,
                max_tokens=512,
            )
            print("Model: ", model)
            break
        except together.error.RateLimitError as e:
            print(e)
            await asyncio.sleep(sleep_time)
    return response.choices[0].message.content

### Constructing the Multi-Layer Proposal MoA

Now we can construct our MoA system!

> NOTE: You'll notice that we don't keep a running list of responses, so Layer #3 might be influenced by Layer #1 through the output of Layer #2 - but Layer #3 doesn't directly see the outputs of Layer #1

In [None]:
async def multi_layer_moa(user_prompt, layers):
    """Run the main loop of the MOA process."""
    ### Layer #1: In order to collect the first set of results - we run Layer #1 separately
    results = await asyncio.gather(*[run_llm(model, user_prompt) for model in proposal_models])

    ### Layer #N-1: We run the rest of our layers in a loop, collecting the results for each pass and including them in the next input.
    for _ in range(1, layers - 1):
        results = await asyncio.gather(
            *[run_llm(model, user_prompt, prev_response=results) for model in proposal_models]
        )

    ### Layer #N: Use the concatenated results as input, provide the aggregator system prompt, and call the aggregator model.
    finalStream = client.chat.completions.create(
        model=aggregator_model,
        messages=[
            {
                "role": "system",
                "content": collect_responses(aggreagator_system_prompt, results),
            },
            {"role": "user", "content": user_prompt},
        ],
        stream=True,
    )
    for chunk in finalStream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

In [None]:
asyncio.run(multi_layer_moa(
    user_prompt="What is the meaning of life?",
    layers=3
))

Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
Error code: 429 - {"message": "You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.", "type_": "credit_limit"}
Model:  meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
Model:  google/gemma-2-27b-it
Model:  zero-one-ai/Yi-34B-Chat
Model:  deepseek-ai/de