# Task
Your task is to implement a framework of multi-agent collaboration that use RAG as a tool to answer a set of hard mathematical questions.

You will be asked to build your pipeline using a variety of tools from open-source libraries as well as get hands-on experience with accelerating state-of-the-art models through quantization for inference on a free, commercial GPU.

---------
### Please note that you need to use a GPU instance to solve the exersize
---------


# Grading
The subtasks as well as their respective points are given below:
- Task 1 - Data preparation (2pt)
- Task 2 - RAG preparation (3pt) / Custom Retriever (Extra 1pt)
- Task 3 - ZS and RAG experiments (3pt)
- Task 4 - Multi-Agent experiments (4pt)
- Task 5 - Tutor Tool experiments (2pt)




### Install necessary libraries

In [None]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python -q
!pip install llama-index -q
!pip install numba -q
!pip install llama-index-retrievers-bm25 -q
!pip install datasets -q
!pip install llama-index-vector-stores-postgres -q
!pip install llama-index-embeddings-huggingface -q
!pip install llama-index-llms-llama-cpp -q
!pip install langchain-community -q

### Imports

In [None]:
import re
from collections import Counter
import pandas as pd
from datasets import load_dataset
import asyncio
import nest_asyncio
nest_asyncio.apply()
from typing import List
from pathlib import Path
import llama_index
from llama_index.readers.file import CSVReader
from llama_index.llms.llama_cpp import LlamaCPP as IndexWrapperLlama
from langchain_community.llms import LlamaCpp as ChatWrapperLlama
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import PromptTemplate
from tqdm.asyncio import tqdm
from llama_index.core.schema import NodeWithScore
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
from llama_index.core.query_engine import RetrieverQueryEngine

### Task 1: Data & Model preparation
##### You will work with the infamous [MATH dataset](https://github.com/hendrycks/math?tab=readme-ov-file#measuring-mathematical-problem-solving-with-the-math-dataset)

It consists of two splits (train/test) and a total of 12500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.


In [None]:
dataset = load_dataset("lighteval/MATH") ### Use this link to download the dataset in Huggingface format.

#### In the cells below you need to:
- Keep only rows that correspond to Level 1 difficulty (level column) and where the total length of the problem and the solution is no longer than 4028 characters. Keep only the problem and solution columns afterwards.
- Preprocess the data by removing everything between [asy] and [/asy] that might exists in answers.
- In the train set, merge the columns under a single one named ```problem_with_solution``` where the format is:
```
    Problem:
    {content of problem column}

    Solution:
    {content of solution column}
```
- Save both dataframes as math_dataset_{```split```}.csv

In [None]:
# Access the training data
train_data =

# Access the test data
test_data =

def clean_problem(example):
    """ Removes text between [asy] and [/asy] including the tags in the 'problem' column. """
    return

# Apply the function to the 'problem' column
train_data = train_data.map(clean_problem)
# Apply the function to the 'solution' column and create a new column
test_data = test_data.map(clean_problem)


# Save them as CSVs #
df = pd.DataFrame(train_data)
df['problem_with_solution'] =
df.to_csv("math_dataset_train.csv", index=False)

df = pd.DataFrame(test_data)
df.to_csv("math_dataset_test.csv", index=False)

#### In the following section you will start building the RAG pipeline using the Llamaindex library, Huggingface Embeddings and LlamaCPP inference acceleration framework.

#### **Why Retrieval Augmented Generation?**
In this exercise, you will leverage the strengths of in-context learning through a RAG approach to enhance your LLM's handling of complex mathematical problems. For each problem in the test set, RAG will retrieve relevant question-answer pairs from the training set to provide you with in-context examples. This method capitalizes on few-shot learning, enabling the LLM to quickly adapt without retraining, using these examples as direct references.

This approach is especially beneficial in mathematics, where different problems require specific methods. By using RAG, you are supplied with pertinent, problem-specific data for each new query, enhancing your ability to solve diverse mathematical challenges effectively. This setup showcases how integrating RAG with few-shot principles can significantly boost performance by providing focused, relevant examples that guide you in real-time problem-solving.

#### **To Implement the RAG pipeline fill the cells below to load 2 models:**
- The Embedding model that will be used to:
 -  Convert the problem-answer pairs of our training set into a vector database.

 -  Match each incoming test problem with a number of retrieved items from the database.

- The Chat model that given a query will:

    - Reformulate and propose different versions of the query.

    - Given the retrieved results from the vector database and the current problem, it will try to solve the task at hand.

In [None]:
### Embedding model setup ###
### Hint: You can choose any library / model you want from API endpoints (OpenAI, Cohere ...), Sentence Transformers or HuggingFaceEmbeddings (preffered)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

##### For the model we will use the [LlamaCPP library](https://llama-cpp-python.readthedocs.io/en/stable/) which offers lightning-fast speed in inference on a commerical GPU.
---------------------------------------------------------------------
##### The suggested model to use will be the Quantized Version of the latest Mistral-7B Instruct model. Feel free to use any other model you want but be careful regarding GPU Memory requirements!


In [None]:
!wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

In [None]:
llm = IndexWrapperLlama(
    model_url=None,
    model_path='mistral-7b-instruct-v0.2.Q4_K_M.gguf',
    temperature= #???,
    max_new_tokens= #???,
    context_window=32000, ### Make sure it fits the GPU you use. Depending on your context and the K value at RAG (see below) you will need 10~12k. If possible use all 32k.
    model_kwargs={"n_gpu_layers": 32}, ### Make sure it fits the GPU you use. The model has 32 layers so technically all of them should fit in a free tier T4 GPU.
)
llm.verbose=False

#### **Provide a short answer here on why you chose the values of temperature / max_new_tokens / context_window you chose.**
----------------------------------------------------------------------------
Answer:

#### Let's create now a VectoreStore Index using the joint problem-solution column of the training dataset.
#### Use the training set csv to populate your DB.
*(Hint: Use VectorStoreIndex class of LlamaIndex)*

In [None]:
loader = CSVReader()
documents = loader.load_data(file=Path('./math_dataset_train.csv'))

splitter =
index =

#### Let's talk about Query Expansion, and why it is useful in complex RAG scenarios.
-----------------------------------------------------------------------------
Query expansion is a technique used to enhance the scope of a search query by generating additional related queries. This approach enriches the query process, increasing the likelihood of retrieving more comprehensive and relevant information. By generating variations of a math query, the system can pull from a wider array of similar past problems, leading to more robust and reliable answers.

#### **To Implement Query Expansion fill the cells below to:**
- Prompt your Chat Model with a query (test math problem) and return N different reformulations of the query.

In [None]:
query_gen_prompt_str = (
    "You are a helpful assistant that generates multiple search queries based on a "
    "single input query. Generate {num_queries} search queries, one on each line, "
    "related to the following input query:\n"
    "Query: {query}\n"
    "Queries:\n"
)
query_gen_prompt = PromptTemplate(query_gen_prompt_str)

def generate_queries(llm, query_str: str, num_queries: int = 4):
    """
        Fill in the code to return a list of the original query (query_str) followed by (num_queries - 1) queries generated by the llm.
    """
    return

In [None]:
### Test: It should return the original query plus 9 new queries ###
query_str = "What is the solution of 2^{x-3} = 3^{x-2}?"
queries = generate_queries(llm, query_str, num_queries=10)
for f in queries:
    print(f)

#### Implementing the Fusion Retriever
##### The BM25 retriever:
--------------------------------------------------------------------------
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters.
Given a query \( Q \), containing keywords \( q_1, ..., q_n \), the BM25 score of a document \( D \) is:

$$
\text{score}(D,Q) = \sum _{i=1}^{n} \text{IDF}(q_{i}) \cdot \frac{f(q_{i},D) \cdot (k_{1}+1)}{f(q_{i},D) + k_{1} \cdot (1-b + b \cdot \frac{|D|}{\text{avgdl}})}
$$

where $f(q_{i}, D)$ is the number of times that the keyword $q_i$ occurs in the document $D$, $|D|$ is the length of the document $D$ in words, and avgdl is the average document length in the text collection from which documents are drawn. $k_1$ and $b$ are free parameters, usually chosen, in absence of an advanced optimization, as $k_1 \in [1.2, 2.0]$ and $b = 0.75$.
IDF($q_i$) is the inverse document frequency weight of the query term $q_i$.


#### **To Implement Fusion Retriever fill the cells below to:**
- Load the BM25 retriever from llama_index retrievers package
- Initialize a Vector retriever from your created VectorStoreIndex class
- Fill in the run_queries function below that will run a set of queries against your combined retrievers and return the top_k results.

#### **For an extra 1 point:**
- Extend the Retriever Class of LlamaIndex and create a custom BM25 retriever, with $k_1=1.5$ and $b = 0.7$  it does not have to be optimized, although a heap would be benefitial for performance.


In [None]:
# Use common TOP_K in both retrievers
TOP_K = 3
NUMBER_OF_RETRIEVERS = 2
## Vector retriever
vector_retriever =

## BM25 retriever
bm25_retriever =

async def run_queries(queries, retrievers):
    """Run queries against retrievers."""
    tasks = []
    for query in queries:
        for i, retriever in enumerate(retrievers):
            tasks.append(retriever.aretrieve(query))

    task_results = await tqdm.gather(*tasks)

    results_dict = {}
    for i, (query, query_result) in enumerate(zip([item for item in queries for _ in range(NUMBER_OF_RETRIEVERS)], task_results)):
        results_dict[(query, i)] = query_result

    return results_dict

#### Important Note:

--------------------------------------------
 Given N queries and M retrievers each with TOP_K results to return the ```run_queries``` function can theoretically return up to $N \times M \times TOP_K$ results. In reality, since a lot of the results are common between the retrievers, the total amount of retrieved items can be less.
Expect the least amount of items to be $ N \times TOP_K$, in the case where both retrievers would retrieve exactly the same things.

#### The ```results_dict``` has N entries, each corresponding to a query given. #### Each item of ```results_dict``` is a ```List``` of ```Llamaindex Nodes``` #### with length ranging from $TOP_K$ (in case that the retrievers found the #### same items ) up to $M \times TOP_K$ (in the case they found different items).
---------------------------------------------------

In [None]:
### Test ###
query_str = "What is the solution of 2^{x-3} = 3^{x-2}?"
queries = generate_queries(llm, query_str, num_queries=3)
results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])
for f in results_dict:
    print(f)

#### **Retrieve and combine.**
Fill the cell below so that your function collects the results from the combined retrieval - as Nodes in LlamaIndex - and sorts them according to their score (node.score), then it returns the top_k results based on that score.
To get the actual text of a node use node.get_content()

In [None]:
def fuse_results(results_dict, similarity_top_k: int = 2):
    """
    Create two dictionaries: fused_scores to store the cumulative scores for each unique text content, and text_to_node to map each text to its respective node and score.
    Loop through a dictionary results_dict that contains lists of node objects with associated scores.
        For each list of nodes:
            Sort the nodes in descending order based on their scores.
            Extract the text content of each node using a method like node.get_content().
            For each text content, check if it's already in fused_scores. If not, initialize its score to 0.0.
        Update the score of this text in fused_scores by adding the reciprocal of its rank (i.e., 1 divided by the position in the sorted list plus 1).
    Sorting and Re-ranking:
        Sort the fused_scores dictionary by value in descending order to prioritize texts with higher aggregated scores.
    Adjusting Node Scores:
        Based on the sorted scores, create a list ranked_nodes to store nodes with their updated scores.
        Populate this list by mapping each text back to its original node, updating the node’s score to the newly calculated aggregated score.
        Ensure only the top entries defined by similarity_top_k (an optional parameter with a default of 2) are returned.
    Function Return:
        The function should return the list of top nodes based on the sorted updated scores.
    """
    fused_scores = {}
    text_to_node = {}

    # Compute scores
    for nodes_with_scores in results_dict.values():
        for rank, node_with_score in enumerate(
            sorted(
                nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True
            )
        ):

    # Sort results
    ranked_results =

    # Adjust node scores
    ranked_nodes: List[NodeWithScore] = []
    for text, score in ranked_results.items():
        # Fill this
    return ranked_nodes[:similarity_top_k]

In [None]:
### Test it ###
query_str = "What is the solution of 2^{x-3} = 3^{x-2}?"
queries = generate_queries(llm, query_str, num_queries=4)
results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])
final_results = fuse_results(results_dict)
for n in final_results:
    print(f"Score: {n.score}", "\n", n.text, "\n---\n")

#### **From Retrieval to Math problem solving:**
Take a look at the FusionRetriever Class below:

- It overrides the BaseRetriever class, by altering the functionality of the _retrieve function to implement query expansion given a provided argument.

- Then it calls the fuse_results function to score the examples keeping the top_k ones, packing the previous steps in a single Retriever Class

In [None]:
class FusionRetriever(BaseRetriever):
    """Ensemble retriever with fusion."""
    def __init__(
        self,
        llmcpp,
        retrievers: List[BaseRetriever],
        similarity_top_k: int = 1,
        n_query_expansion: int = 1,
        enable_query_expansion: bool = False,

    ) -> None:
        """Init params."""
        self._retrievers = retrievers
        self._similarity_top_k = similarity_top_k
        self._llmcpp = llmcpp
        self.n_query_expansion = n_query_expansion
        self.enable_query_expansion = enable_query_expansion
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve the results. REMEMBER: to run an async function use asyncio.run(###THE ASYNC FUNCTION###)"""
        return final_results

##### Test the following function and answer the following questions:
-  What do you observe with / without the query expansion flag given the query below?
-  The retrieved text is simply added before our problem string. Given that our model is an Instruction Tuned model, is this correct?

------------------------------------------------------------------------------
Answer:


In [None]:
query_str = "\nProblem:\nWhat is the solution of 2^{x-3} = 3^{x-2}?\nSolution:\n"
fusion_retriever = FusionRetriever(
   llm, [vector_retriever, bm25_retriever], similarity_top_k=3
)
query_engine = RetrieverQueryEngine.from_args(fusion_retriever, llm=llm)
response = query_engine.query(query_str)
print(response.response)

##### Now, disable query expansion, set top_k to as high as your Gpu and context size enables (3 is a good value).

In order to properly format the retrieved problem-solution pairs into useful in-context examples we need to prompt the Chat Model (through the RetrieverQueryEngine call).

This can be done with the text_qa_template argument, which recieves a PromptTemplate class object. This object is a string that can be formated in 2 positions ```{context_str}``` and ```{query_str}```.

For example:
```
text_qa_template_str = (
    This is an example.
    The retrieved items will be put here {context_str}.
    While the current problem will be put here {query_str}.
)
text_qa_template = PromptTemplate(text_qa_template_str)

```

Your goal is to find an appropriate way to prompt your Chat Model so that it can properly utilize its context examples.


In [None]:
### Test it ###
query_str = "\nProblem:\nWhat is the solution of 2^{x-3} = 3^{x-2}?\nSolution:\n"
fusion_retriever = FusionRetriever(
   llm, [vector_retriever, bm25_retriever], similarity_top_k=3
)
text_qa_template_str = ('FILL THIS')
text_qa_template = PromptTemplate(text_qa_template_str)

query_engine = RetrieverQueryEngine.from_args(fusion_retriever, llm=llm, text_qa_template=text_qa_template)
response = query_engine.query(query_str)
print(response.response)

#### Iterate over the first **35** questions of the Test dataset and calculate the respective performance.

#### **Perform the following check: Zero-Shot + Inception Prompting.**
Prepend the problem task with an inception prompt (You are a...) which is found to be benefitial for performance in reasoning and math tasks.

You will encounter the following  2 problems:
- Your model will probably answer a long analytic solution with multiple steps and numbers here and there. Fill the extract_last_floating_number so that you get the last real (positive or negative) number from a string, and apply it to your solutions.
- The MATH solutions are located at the end of the string enclosed in a \boxed{} LaTeX command. Fill the extract_answer_from_boxed so that you get the ground truth answer.

In [None]:
def extract_last_floating_number(text):
    """
    Extracts the last real number from a string.
    """

def extract_answer_from_boxed(expression):
    """
    Extracts the content within the \boxed{} LaTeX command.
    """

In [None]:
### Iterate over 35 items of test set ###

#### Iterate over the first **35** questions of the Test dataset and calculate the respective performance.

#### **Perform the following check: FusionRAG + K-way.**
Now test the FusionRAG performance by performing queries to your RetrieverQueryEngine made in the previous step.

You need to fill the following infer_at_k function to perform the following things:
- Accept as an argument the incoming math problem and a parameter k.
- Retrieve the relevant in context examples from the FusionRetriever.
- Format the retrieved incontext examples and pass them to the RetrieverQueryEngine.
- Generate k responses from the RetrieverQueryEngine.
---------------------------------------------------------------------
- **Very Important:** Make sure you apply some post-processing to your responses.
You most probably need to truncate them, if you see that the Chat Model after answering your question mades up another problem and tries to solve it as well.
For example you might look for the keyword ```"Problem:"``` in the answer and keep everything before it.
Same can be said for patterns like  multiple empty new lines ```'\n\n\n'``` or ```'-----'```.
- Your function should return two lists: The strings that represent the truncated responses and a list with the answers as numbers (or None if no number could be extracted). The lists should each have length equal to k.

(Hint: Use the ```extract_last_floating_number``` to perform this)


In [None]:
def most_common_item(items):
    if not items:
        return 0
    counter = Counter(items)
    most_common, _ = counter.most_common(1)[0] if counter else (0, 0)
    return most_common

In [None]:
def infer_at_k(problem, k=3):
    fusion_retriever =
    text_qa_template_str = ("FILL THIS")
    text_qa_template = PromptTemplate(text_qa_template_str)
    query_engine =
    responses = []
    full_text_responses = []
    return responses, full_text_responses

#### Iterate over the 35 test examples below ###
-----------------------------------------------------------------------------
Hint: It is quite useful - although not necessary - if you also save the generated responses and the generated numbers for each of the 35 questions into two pickle files named ```rag_test_responses.pt``` and ```rag_test_numbers.pt```

#### Answer the following questions:
- What would be an appropriate temperature value for trying multiple times to come up with an answer for the same question?
- There are questions where the model answers correctly in all of the tries, and others where the model might answer correctly only once among its different tries. What does this phenomenon tell us about the math capabilities of the tested model?
----------------------------------------------------------------------------
Answer:

### Multi-Agent Setup with CrewAI

From a single LLM to multiple agents: You might have heard about different ways that the concept of an agent has been incorporated into LLM communities. The benefits of delegating tasks and personalisation of Agents provide significant boosts over a single LLM model.

Here you will use the [CrewAI library](https://docs.crewai.com/), to build a multi-agent system that will incorporating meta-cognition and error checking.

-----------------------------------------------------------------------------
Take a look at the documentation and examples of how to initialize an [Agent](https://docs.crewai.com/core-concepts/Agents/), a [Task](https://docs.crewai.com/core-concepts/Tasks/), a [Tool](https://docs.crewai.com/core-concepts/Tools/#creating-your-own-tools) and a [Crew](https://docs.crewai.com/core-concepts/Crews/).

These are the only components you need!

In [None]:
### Import Crew AI ###
!pip install 'crewai[tools]' -q
!pip install cohere -q
!pip install anthropic -q
!pip install -U langchain-anthropic
!pip install --upgrade langchain_experimental -q

import os
import numpy as np
from crewai import Crew, Process, Agent, Task
from langchain_community.chat_models import ChatCohere
from crewai_tools import BaseTool

#### Our Goal now is to create the following multi-agent setup:

RAG --> Chat Model Agent --> K-Answers --> Solution Analyzer --> Feedback --> Summary Writer --> Final Answer

Here:
* Solution Analyzer: Agent whose purpose is to look at the current problem and different ways to solve it, and choose the correct one out of them.
If there are no suggested ways to solve the problem, or the agent thinks that none of them is correct, the agent can suggest its own solution.

* Feedback: A string reflecting the response of the Solution Analyzer. Can be a step of solutions, a string saying "All the steps are correct" or anything else the Agent might respond.

* Summary Writer: Agent whose purpose is to look through the Feedback and decide what is the correct number value to be extracted from it as a solution. It will emulate the behaviour of the ```extract_last_floating_number``` function you created above but in amore context-aware manner.

* Final Answer: A string / float representing the best out of k possible ways of answering the math question according to the agent pipeline.
---------------------------------------------------------------------------
**Each Agent has access only to the output of the previous Agent, and the inputs of the current Task.**

---------------------------------------------------------------------------
##### **Important:** Since we have already performed the step up to K-answers (and possibly saved the results) there is no reason of redoing it. If thats the way you want to proceed look at the ```retrieve_at_k``` function below and modify it so that it does not make calls to the ChatModel, but rather accesses the saved pickles instead.


##### If you want to proceed with the retrieve_at_k function, there is no problem at all, you will just have to wait a bit more while evaluating the results.

-------------------------------------------------------------------------------
#### Now let's have a look at the following modified FusionRAG K-Way function:
It performs the same as previously, but now it drops None in extracted number solutions (and their respective analytical solutions) and returns a context describing our models various attempts into solving the problem, if no proposed solution was made by our ChatModel it asks for help.


In [None]:
def retrieve_at_k(argument, k=3):
    message_of_solution = ''
    message_of_no_solution = 'Unfortunately I have no idea how to solve this problem. Can you help me?'
    ### Filter out non-None responses ###
    numerical_responses, full_answers = infer_at_k(argument, k=k)

    final_responses =  []
    for n,f in zip(numerical_responses, full_answers):
        if n is None:
            pass
        else:
            final_responses.append(f)

    if len(final_responses) == 0:
        return message_of_no_solution
    else:
        for i in range(len(final_responses)):
            message_of_solution += f'\nSolution {i}:\n{final_responses[i]}\n'
    return message_of_solution

In [None]:
### Test it with a random problem ###


#### Create a MathTask class that implements as methods a set of custom tasks to be run by our agents.

Tasks:
* validation: The task of the Solution Analyzer. It needs to have access to the agent, the current task and the proposed solutions.

* summary: The task of the Summary Writer. It needs access only to the current agent.


In [None]:
from crewai import Task
from textwrap import dedent

class MathTasks():
  def validation(self, ### Fill the arguments
  ):
    return Task(description=dedent(f"""
        Fill this
      """),
      agent=agent,
      expected_output='Fill this'
    )

  def summary(self,  ### Fill the arguments
              ):
    return Task(description=dedent(f"""
        Fill this
      """),
      agent=agent,
      expected_output='Fill this'
    )

#### Make up your crew using **ONE** of the following:
* OpenAI (I recommend GPT3.5 Turbo)
* Cohere (I recommend command-r-plus)
* Anthropic (I recommend any model)

In [None]:
from langchain_community.chat_models import ChatCohere
from langchain_community.chat_models import ChatOpenAI
from langchain_anthropic import ChatAnthropic

#os.environ["OPENAI_API_KEY"] =
#os.environ["COHERE_API_KEY"] =
#os.environ["ANTHROPIC_API_KEY"] =

#agent_base_llm = ChatCohere(model='command-r-plus', temperature=0.2)
#agent_base_llm = ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo-1106")
#agent_base_llm = ChatAnthropic(temperature=0.2, model_name="claude-3-haiku-20240307")

#### **OR** Keep the same model but at a different wrapper using Langchain experimental.

-------------------------------------------------------------------------------
**Important:** If you choose this option you will need to release the GPU memory from the previous loaded model, you can do this by hitting Runtime-->Restart Session, load every import again and instead of the previous model load this model.
------------------------------------------------------------------------------

In [None]:
from langchain_experimental.chat_models import Llama2Chat

llm = ChatWrapperLlama(
    model_url=None,
    model_path='mistral-7b-instruct-v0.2.Q4_K_M.gguf',
    temperature=###,
    max_new_tokens=###,
    n_ctx=###, (This is the context length but with different argument name in this wrapper)
    n_gpu_layers= 32,
    top_p=0.95
)
llm.verbose=False

agent_base_llm = Llama2Chat(llm=llm)

In [None]:
MAX_RPM_GLOBAL = 100
N_AGENTS = 2

solution_analyzer = Agent(
  role=
  goal=
  backstory=
  llm=agent_base_llm,
  max_rpm=MAX_RPM_GLOBAL // N_AGENTS,
)

summary_writer = Agent(
  role=
  goal=
  backstory=
  llm=agent_base_llm,
  max_rpm=MAX_RPM_GLOBAL // N_AGENTS,
)

In [None]:
def pipeline(current_task, k=3):
    # Define the tasks in sequence
    proposed_solutions = ### FILL THIS and use K here
    analysis_task = MathTasks().validation(### FILL THIS)
    writing_task = MathTasks().summary(### FILL THIS)

    # Form the crew with a sequential process
    report_crew = Crew(### FILL THIS)
    # Execute tasks
    res = report_crew.kickoff()
    return res

### Measure Multi-Agent Performance




In [None]:
### Write code that iterates over the first 35 questions of the test dataset.

### Question: What is the performance difference? Did the validation agent help? Did you see any issues with the summary writer?
-------------------------------------------------------------------------------
Answer:


### Incorporate a Tutoring Mechanism.
Our pipeline will now look like this:

RAG --> Chat Model Agent --> K-Answers --> Solution Analyzer --> Feedback --> Tutor--> Ground Truth Hint --> Solution Analyzer --> Updated Feedback --> Summary Writer --> Final Answer

* Tutor: Agent that given the current task, the provided solutions and access to the ground truth answer (from ```'./math_dataset_test.csv'```) will provide the Solution Analyzer with hints regarding the correctness of their answer. Try to make the Agent not reveal the correct answer if you can (optional).

* The Solution Analyzer will be then engaged again in a new task called reflect where they should reflect on the tutors hint and decide on a final answer.

------------------------------------------------------------------------------
**Important**: To implement access to the ground truth data, you need to:

Implement a Class SoftTutorDB that uses the embedding model you have loaded, to encode the ground truth problems in the test CSV (problems only). Then given a new problem it needs to return the top 1 solution (it is guranteed to find a solution). Then use the provided TutorTool below and incorporate it into your pipeline.

This option is not guranteed to work well if a non-API model has been chosen for the multi-agent pipeline. So you will be graded on basis of implementation and not actuall performance for this task


In [None]:
class SoftTutorDB:
    def __init__(self, file='./math_dataset_test.csv', problem_index='problem', solution_index='solution'):
        self.db = pd.read_csv(file)
        self.pi = problem_index
        self.si = solution_index
        self._encode_problems()

    def _encode_problems(self):
        self.keys =
        self.values =

    def get(self, query):



class TutorTool(BaseTool):
    name: str = "Tutoring Tool"
    description: str = "Given a math problem this tool returns the correct solution to it."
    db: object = SoftTutorDB()
    def _run(self, problem: str) -> str:
        return self.db.get(problem)

#### Expand your MathTask class to implement 2 more tasks: tutoring and reflection.

----------------------------------------------------------------------------
What items does each task need access to? Remember that each Agent has access only to the output of the previous Agent, and the input of the Task.


In [None]:
class MathTasks():
    def validation()

    def summary()

    def tutoring()

    def reflect()

In [None]:
analyst = Agent(
    role=
    goal=
    backstory=,
    tools = [],
    llm=agent_base_llm,
    max_rpm=1,
)

tutor = Agent(
  role=
  goal=
  backstory=,
  tools = [TutorTool()],
  llm=agent_base_llm,
  max_rpm=1,
)

writer = Agent(
    role=
    goal=
    backstory=,
    tools = [],
    llm=agent_base_llm,
    max_rpm=1,
)

In [None]:
def pipeline(current_task, k=3):
    # Define the tasks in sequence
    proposed_solutions = ### FILL THIS and use K here
    analysis_task = MathTasks().validation(### FILL THIS)
    writing_task = MathTasks().summary(### FILL THIS)

    # Form the crew with a sequential process
    report_crew = Crew(### FILL THIS)
    # Execute tasks
    res = report_crew.kickoff()
    return res

In [None]:
def pipeline_with_tutoring(current_task, k):
    # Define the tasks in sequence
    proposed_solutions = ### FILL THIS and use K here
    analysis_task = MathTasks().validation(### FILL THIS)
    tutoring_task = MathTasks().tutoring(### FILL THIS)
    reflection_task = MathTasks().reflect((### FILL THIS, Hint: Use context=[analysis_task, tutoring_task] as an extra argument)
    writing_task = MathTasks().summary(### FILL THIS)

    # Form the crew with a sequential process
    report_crew = Crew()
    # Execute tasks
    res = report_crew.kickoff()
    return res

In [None]:
### Write code that iterates over the first 35 questions of the test dataset.