# Task
Your task is to implement a framework of multi-agent collaboration that use RAG as a tool to answer a set of hard mathematical questions.

You will be asked to build your pipeline using a variety of tools from open-source libraries as well as get hands-on experience with accelerating state-of-the-art models through quantization for inference on a free, commercial GPU.

---------
### Please note that you need to use a GPU instance to solve the exersize
---------


# Grading
The subtasks as well as their respective points are given below:
- Task 1 - Data preparation (2pt)
- Task 2 - RAG preparation (3pt) / Custom Retriever (Extra 1pt)
- Task 3 - ZS and RAG experiments (3pt)
- Task 4 - Multi-Agent experiments (4pt)
- Task 5 - Tutor Tool experiments (2pt)




### Install necessary libraries

In [None]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python -q
!pip install llama-index -q
!pip install numba -q
!pip install llama-index-retrievers-bm25 -q
!pip install datasets -q
!pip install llama-index-vector-stores-postgres -q
!pip install llama-index-embeddings-huggingface -q
!pip install llama-index-llms-llama-cpp -q
!pip install langchain-community -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m81.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m90.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [

### Imports

In [None]:
import re
from collections import Counter
import pandas as pd
from datasets import load_dataset
import asyncio
import nest_asyncio
nest_asyncio.apply()
from typing import List
from pathlib import Path
import llama_index
from llama_index.readers.file import CSVReader
from llama_index.llms.llama_cpp import LlamaCPP as IndexWrapperLlama
from langchain_community.llms import LlamaCpp as ChatWrapperLlama
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import PromptTemplate
from tqdm.asyncio import tqdm
from llama_index.core.schema import NodeWithScore
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
from llama_index.core.query_engine import RetrieverQueryEngine

### Task 1: Data & Model preparation
##### You will work with the infamous [MATH dataset](https://github.com/hendrycks/math?tab=readme-ov-file#measuring-mathematical-problem-solving-with-the-math-dataset)

It consists of two splits (train/test) and a total of 12500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.


In [None]:
dataset = load_dataset("lighteval/MATH") ### Use this link to download the dataset in Huggingface format.

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/707k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.15M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/639k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/903k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/706k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/562k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/860k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/553k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/614k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

#### In the cells below you need to:
- Keep only rows that correspond to Level 1 difficulty (level column) and where the total length of the problem and the solution is no longer than 4028 characters. Keep only the problem and solution columns afterwards.
- Preprocess the data by removing everything between [asy] and [/asy] that might exists in answers.
- In the train set, merge the columns under a single one named ```problem_with_solution``` where the format is:
```
    Problem:
    {content of problem column}

    Solution:
    {content of solution column}
```
- Save both dataframes as math_dataset_{```split```}.csv

In [None]:
# Access the training data
train_data = dataset['train']

# Access the test data
test_data = dataset['test']

def clean_problem(example):
    """ Removes text between [asy] and [/asy] including the tags in the 'problem' column. """
    example['problem'] = re.sub(r'\[asy\].*?\[/asy\]', '', example['problem'], flags=re.DOTALL)
    example['solution'] = re.sub(r'\[asy\].*?\[/asy\]', '', example['solution'], flags=re.DOTALL)

    return example

def postprocessing(data_object):
    data_object = data_object.filter(lambda ex: ex["level"] == 'Level 1')
    data_object = data_object.filter(lambda ex: len(ex["problem"]) + len(ex["solution"]) <= 4028)
    return data_object.select_columns(['problem', 'solution'])

# Apply the function to the 'problem' column
train_data = train_data.map(clean_problem)
train_data = postprocessing(train_data)
# Apply the function to the 'solution' column and create a new column
test_data = test_data.map(clean_problem)
test_data = postprocessing(test_data)


# Save them as CSVs #
df = pd.DataFrame(train_data)
df['problem_with_solution'] = 'problem_with_solution' + df['problem'] + '\n\nSolution:\n' + df['solution']
df.to_csv("math_dataset_train.csv", index=False)

problem_with_solution_df = df['problem_with_solution'].copy()
problem_with_solution_df.to_csv("math_dataset_problem_with_solution", index=False)

df = pd.DataFrame(test_data)
df.to_csv("math_dataset_test.csv", index=False)

Map:   0%|          | 0/7500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/564 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/437 [00:00<?, ? examples/s]

#### In the following section you will start building the RAG pipeline using the Llamaindex library, Huggingface Embeddings and LlamaCPP inference acceleration framework.

#### **Why Retrieval Augmented Generation?**
In this exercise, you will leverage the strengths of in-context learning through a RAG approach to enhance your LLM's handling of complex mathematical problems. For each problem in the test set, RAG will retrieve relevant question-answer pairs from the training set to provide you with in-context examples. This method capitalizes on few-shot learning, enabling the LLM to quickly adapt without retraining, using these examples as direct references.

This approach is especially beneficial in mathematics, where different problems require specific methods. By using RAG, you are supplied with pertinent, problem-specific data for each new query, enhancing your ability to solve diverse mathematical challenges effectively. This setup showcases how integrating RAG with few-shot principles can significantly boost performance by providing focused, relevant examples that guide you in real-time problem-solving.

#### **To Implement the RAG pipeline fill the cells below to load 2 models:**
- The Embedding model that will be used to:
 -  Convert the problem-answer pairs of our training set into a vector database.

 -  Match each incoming test problem with a number of retrieved items from the database.

- The Chat model that given a query will:

    - Reformulate and propose different versions of the query.

    - Given the retrieved results from the vector database and the current problem, it will try to solve the task at hand.

In [None]:
### Embedding model setup ###
### Hint: You can choose any library / model you want from API endpoints (OpenAI, Cohere ...), Sentence Transformers or HuggingFaceEmbeddings (preffered)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

##### For the model we will use the [LlamaCPP library](https://llama-cpp-python.readthedocs.io/en/stable/) which offers lightning-fast speed in inference on a commerical GPU.
---------------------------------------------------------------------
##### The suggested model to use will be the Quantized Version of the latest Mistral-7B Instruct model. Feel free to use any other model you want but be careful regarding GPU Memory requirements!


In [None]:
!wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

--2024-06-03 14:32:15--  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
Resolving huggingface.co (huggingface.co)... 18.238.109.121, 18.238.109.52, 18.238.109.92, ...
Connecting to huggingface.co (huggingface.co)|18.238.109.121|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/72/62/726219e98582d16c24a66629a4dec1b0761b91c918e15dea2625b4293c134a92/3e0039fd0273fcbebb49228943b17831aadd55cbcbf56f0af00499be2040ccf9?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27mistral-7b-instruct-v0.2.Q4_K_M.gguf%3B+filename%3D%22mistral-7b-instruct-v0.2.Q4_K_M.gguf%22%3B&Expires=1717681954&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNzY4MTk1NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzcyLzYyLzcyNjIxOWU5ODU4MmQxNmMyNGE2NjYyOWE0ZGVjMWIwNzYxYjkxYzkxOGUxNWRlYTI2MjViNDI5M2MxMzRhO

In [None]:
llm = IndexWrapperLlama(
    model_url=None,
    model_path='mistral-7b-instruct-v0.2.Q4_K_M.gguf',
    temperature=0,
    max_new_tokens=512,
    context_window=32000, ### Make sure it fits the GPU you use. Depending on your context and the K value at RAG (see below) you will need 10~12k. If possible use all 32k.
    model_kwargs={"n_gpu_layers": 32}, ### Make sure it fits the GPU you use. The model has 32 layers so technically all of them should fit in a free tier T4 GPU.
)
llm.verbose=False

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 l

#### **Provide a short answer here on why you chose the values of temperature / max_new_tokens / context_window you chose.**
----------------------------------------------------------------------------
Answer:

I used the same temperature value as in Measuring Mathematical Problem Solving With the MATH Dataset paper. 512 max new tokens is a standard value which allows for proper explanation of the problem, taking into account that problems have different complexity.

#### Let's create now a VectoreStore Index using the joint problem-solution column of the training dataset.
#### Use the training set csv to populate your DB.
*(Hint: Use VectorStoreIndex class of LlamaIndex)*

In [None]:
loader = CSVReader()
documents = loader.load_data(file=Path('./math_dataset_problem_with_solution'))

splitter = SentenceSplitter(separator='problem_with_solution')
index = VectorStoreIndex.from_documents(
    llm=llm,
    documents=documents,
    splitter=splitter,
    embed_model=embed_model,
    verbose=False
)

#### Let's talk about Query Expansion, and why it is useful in complex RAG scenarios.
-----------------------------------------------------------------------------
Query expansion is a technique used to enhance the scope of a search query by generating additional related queries. This approach enriches the query process, increasing the likelihood of retrieving more comprehensive and relevant information. By generating variations of a math query, the system can pull from a wider array of similar past problems, leading to more robust and reliable answers.

#### **To Implement Query Expansion fill the cells below to:**
- Prompt your Chat Model with a query (test math problem) and return N different reformulations of the query.

In [None]:
query_gen_prompt_str = (
    "You are a helpful assistant that generates multiple search queries based on a "
    "single input query. Generate {num_queries} search queries, one on each line, "
    "related to the following input query:\n"
    "Query: {query}\n"
    "Queries:\n"
)
query_gen_prompt = PromptTemplate(query_gen_prompt_str)

def generate_queries(llm, query_str: str, num_queries: int = 4):
    """
        Fill in the code to return a list of the original query (query_str) followed by (num_queries - 1) queries generated by the llm.
    """
    prompt = query_gen_prompt_str.format(query=query_str, num_queries=num_queries - 1)
    response = str(llm.complete(prompt))
    queries = response.split("\n")
    queries = [query_str] + [''.join(query.split(".")[1:])[1:] for query in queries if query]

    return queries

In [None]:
### Test: It should return the original query plus 9 new queries ###
query_str = "What is the solution of 2^{x-3} = 3^{x-2}?"
queries = generate_queries(llm, query_str, num_queries=10)
for f in queries:
    print(f)


llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     219.47 ms /   255 runs   (    0.86 ms per token,  1161.87 tokens per second)
llama_print_timings: prompt eval time =     374.68 ms /    65 tokens (    5.76 ms per token,   173.48 tokens per second)
llama_print_timings:        eval time =   12166.88 ms /   254 runs   (   47.90 ms per token,    20.88 tokens per second)
llama_print_timings:       total time =   13220.27 ms /   319 tokens


What is the solution of 2^{x-3} = 3^{x-2}?
Find the value of x that satisfies the equation 2^(x-3) = 3^(x-2)
Solving for x in the equation 2^(x-3) = 3^(x-2)
What is the root of the equation 2^(x-3) = 3^(x-2)?
Solve the exponential equation 2^(x-3) = 3^(x-2) for x
Equation 6: 2^(x-3) = 3^(x-2), find the value of x
X value in the equation 2^(x-3) = 3^(x-2)
Solve the exponent equation 2^(x-3) = 3^(x-2) for the variable x
Find the solution to the exponential equation 2^(x-3) = 3^(x-2)
What is the value of x in the equation 2^(x-3) = 3^(x-2)?


#### Implementing the Fusion Retriever
##### The BM25 retriever:
--------------------------------------------------------------------------
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters.
Given a query \( Q \), containing keywords \( q_1, ..., q_n \), the BM25 score of a document \( D \) is:

$$
\text{score}(D,Q) = \sum _{i=1}^{n} \text{IDF}(q_{i}) \cdot \frac{f(q_{i},D) \cdot (k_{1}+1)}{f(q_{i},D) + k_{1} \cdot (1-b + b \cdot \frac{|D|}{\text{avgdl}})}
$$

where $f(q_{i}, D)$ is the number of times that the keyword $q_i$ occurs in the document $D$, $|D|$ is the length of the document $D$ in words, and avgdl is the average document length in the text collection from which documents are drawn. $k_1$ and $b$ are free parameters, usually chosen, in absence of an advanced optimization, as $k_1 \in [1.2, 2.0]$ and $b = 0.75$.
IDF($q_i$) is the inverse document frequency weight of the query term $q_i$.


#### **To Implement Fusion Retriever fill the cells below to:**
- Load the BM25 retriever from llama_index retrievers package
- Initialize a Vector retriever from your created VectorStoreIndex class
- Fill in the run_queries function below that will run a set of queries against your combined retrievers and return the top_k results.

#### **For an extra 1 point:**
- Extend the Retriever Class of LlamaIndex and create a custom BM25 retriever, with $k_1=1.5$ and $b = 0.7$  it does not have to be optimized, although a heap would be benefitial for performance.


In [None]:
from llama_index.retrievers.bm25 import BM25Retriever

# Use common TOP_K in both retrievers
TOP_K = 3
NUMBER_OF_RETRIEVERS = 2
## Vector retriever
vector_retriever = index.as_retriever(similarity_top_k=TOP_K)

## BM25 retriever
bm25_retriever = BM25Retriever.from_defaults(
   docstore=index.docstore,
   similarity_top_k=TOP_K)

async def run_queries(queries, retrievers):
    """Run queries against retrievers."""
    tasks = []
    for query in queries:
        for i, retriever in enumerate(retrievers):
            tasks.append(retriever.aretrieve(query))

    task_results = await tqdm.gather(*tasks)

    results_dict = {}
    for i, (query, query_result) in enumerate(zip([item for item in queries for _ in range(NUMBER_OF_RETRIEVERS)], task_results)):
        results_dict[(query, i)] = query_result

    return results_dict

#### Important Note:

--------------------------------------------
 Given N queries and M retrievers each with TOP_K results to return the ```run_queries``` function can theoretically return up to $N \times M \times TOP_K$ results. In reality, since a lot of the results are common between the retrievers, the total amount of retrieved items can be less.
Expect the least amount of items to be $ N \times TOP_K$, in the case where both retrievers would retrieve exactly the same things.

#### The ```results_dict``` has N entries, each corresponding to a query given. #### Each item of ```results_dict``` is a ```List``` of ```Llamaindex Nodes``` #### with length ranging from $TOP_K$ (in case that the retrievers found the #### same items ) up to $M \times TOP_K$ (in the case they found different items).
---------------------------------------------------

In [None]:
### Test ###
query_str = "What is the solution of 2^{x-3} = 3^{x-2}?"
queries = generate_queries(llm, query_str, num_queries=3)
results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])
for f in results_dict:
    print(f)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      26.25 ms /    56 runs   (    0.47 ms per token,  2132.93 tokens per second)
llama_print_timings: prompt eval time =     274.02 ms /    45 tokens (    6.09 ms per token,   164.22 tokens per second)
llama_print_timings:        eval time =    2008.44 ms /    55 runs   (   36.52 ms per token,    27.38 tokens per second)
llama_print_timings:       total time =    2355.13 ms /   100 tokens
100%|██████████| 6/6 [00:00<00:00, 65.89it/s]

('What is the solution of 2^{x-3} = 3^{x-2}?', 0)
('What is the solution of 2^{x-3} = 3^{x-2}?', 1)
('Find the value of x that satisfies the equation 2^(x-3) = 3^(x-2)', 2)
('Find the value of x that satisfies the equation 2^(x-3) = 3^(x-2)', 3)
('Solve for x in the equation 2^(x-3) = 3^(x-2)', 4)
('Solve for x in the equation 2^(x-3) = 3^(x-2)', 5)





#### **Retrieve and combine.**
Fill the cell below so that your function collects the results from the combined retrieval - as Nodes in LlamaIndex - and sorts them according to their score (node.score), then it returns the top_k results based on that score.
To get the actual text of a node use node.get_content()

In [None]:
def fuse_results(results_dict, similarity_top_k: int = 2):
    """
    Create two dictionaries: fused_scores to store the cumulative scores for each unique text content, and text_to_node to map each text to its respective node and score.
    Loop through a dictionary results_dict that contains lists of node objects with associated scores.
        For each list of nodes:
            Sort the nodes in descending order based on their scores.
            Extract the text content of each node using a method like node.get_content().
            For each text content, check if it's already in fused_scores. If not, initialize its score to 0.0.
        Update the score of this text in fused_scores by adding the reciprocal of its rank (i.e., 1 divided by the position in the sorted list plus 1).
    Sorting and Re-ranking:
        Sort the fused_scores dictionary by value in descending order to prioritize texts with higher aggregated scores.
    Adjusting Node Scores:
        Based on the sorted scores, create a list ranked_nodes to store nodes with their updated scores.
        Populate this list by mapping each text back to its original node, updating the node’s score to the newly calculated aggregated score.
        Ensure only the top entries defined by similarity_top_k (an optional parameter with a default of 2) are returned.
    Function Return:
        The function should return the list of top nodes based on the sorted updated scores.
    """
    fused_scores = {}
    text_to_node = {}

    # Compute scores
    for nodes_with_scores in results_dict.values():
        for rank, node_with_score in enumerate(
            sorted(
                nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True
            ),
            start=1
        ):
            text_content = node_with_score.get_content()
            if text_content not in fused_scores:
                fused_scores[text_content] = 0.0
            fused_scores[text_content] += 1 / rank
            text_to_node[text_content] = node_with_score

    ranked_results = dict(sorted(fused_scores.items(), key=lambda item: item[1], reverse=True))

    # Adjust node scores
    ranked_nodes: List[NodeWithScore] = []
    for text, score in ranked_results.items():
        # Fill this
        node_with_score = text_to_node[text]
        node_with_score.score = score
        ranked_nodes.append(node_with_score)

    return ranked_nodes[:similarity_top_k]

In [None]:
### Test it ###
query_str = "What is the solution of 2^{x-3} = 3^{x-2}?"
queries = generate_queries(llm, query_str, num_queries=4)
results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])
final_results = fuse_results(results_dict)
for n in final_results:
    print(f"Score: {n.score}", "\n", n.text, "\n---\n")

Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      49.14 ms /    98 runs   (    0.50 ms per token,  1994.26 tokens per second)
llama_print_timings: prompt eval time =     272.26 ms /    45 tokens (    6.05 ms per token,   165.28 tokens per second)
llama_print_timings:        eval time =    3571.25 ms /    97 runs   (   36.82 ms per token,    27.16 tokens per second)
llama_print_timings:       total time =    3968.37 ms /   142 tokens
100%|██████████| 8/8 [00:00<00:00, 97.01it/s]

Score: 3.833333333333333 
 Thus, $x^2-y^2=3^2-7^2=\boxed{-40}$.
problem_with_solution$f (x) = x + 3$ and $g(x) = x^2 -6$, what is the value of $f (g(2))$?

Solution:
$f(g(2))=f(2^2-6)=f(-2)=-2+3=\boxed{1}$.
problem_with_solutionCompute: $55\times1212-15\times1212$ .

Solution:
We have $55 \times 1212 - 15 \times 1212 = 1212(55-15) = 1212(40) = 4848(10) = \boxed{48480}$.
problem_with_solutionCalculate: $(243)^{\frac35}$

Solution:
We start by finding the prime factorization of 243.  We find $243 = 3^5$, so we have $(243)^{\frac35} = (3^5)^{\frac35} = 3^{5\cdot \frac{3}{5}} = 3^3 = \boxed{27}$.
problem_with_solutionSimplify $(1)(2a)(3a^2)(4a^3)(5a^4)$.

Solution:
Simplifying, we have: \begin{align*}
(1)(2a)(3a^2)(4a^3)(5a^4) &= (1)(2)(3)(4)(5)(a)(a^2)(a^3)(a^4) \\
&= 120a^{1+2+3+4} = \boxed{120a^{10}}.
\end{align*}
problem_with_solutionExpand the product ${(x+2)(x+5)}$.

Solution:
When using the distributive property for the first time, we add the product of $x+2$ and $x$ to the product 




#### **From Retrieval to Math problem solving:**
Take a look at the FusionRetriever Class below:

- It overrides the BaseRetriever class, by altering the functionality of the _retrieve function to implement query expansion given a provided argument.

- Then it calls the fuse_results function to score the examples keeping the top_k ones, packing the previous steps in a single Retriever Class

In [None]:
class FusionRetriever(BaseRetriever):
    """Ensemble retriever with fusion."""
    def __init__(
        self,
        llmcpp,
        retrievers: List[BaseRetriever],
        similarity_top_k: int = 1,
        n_query_expansion: int = 1,
        enable_query_expansion: bool = False,

    ) -> None:
        """Init params."""
        self._retrievers = retrievers
        self._similarity_top_k = similarity_top_k
        self._llmcpp = llmcpp
        self.n_query_expansion = n_query_expansion
        self.enable_query_expansion = enable_query_expansion
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve the results. REMEMBER: to run an async function use asyncio.run(###THE ASYNC FUNCTION###)"""
        if self.enable_query_expansion:
            queries = generate_queries(self._llmcpp, query_bundle.query_str, num_queries=self.n_query_expansion)
        else:
            queries = [query_bundle.query_str]
        results_dict = asyncio.run(run_queries(queries, self._retrievers))
        final_results = fuse_results(results_dict, similarity_top_k=self._similarity_top_k)

        return final_results

##### Test the following function and answer the following questions:
-  What do you observe with / without the query expansion flag given the query below?
-  The retrieved text is simply added before our problem string. Given that our model is an Instruction Tuned model, is this correct?

------------------------------------------------------------------------------
Answer:


In [None]:
query_str = "\nProblem:\nWhat is the solution of 2^{x-3} = 3^{x-2}?\nSolution:\n"
fusion_retriever = FusionRetriever(
   llm, [vector_retriever, bm25_retriever], similarity_top_k=3
)
query_engine = RetrieverQueryEngine.from_args(fusion_retriever, llm=llm)
response = query_engine.query(query_str)
print(response.response)

100%|██████████| 2/2 [00:00<00:00, 59.28it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     262.49 ms /   487 runs   (    0.54 ms per token,  1855.34 tokens per second)
llama_print_timings: prompt eval time =    3690.71 ms /  2945 tokens (    1.25 ms per token,   797.95 tokens per second)
llama_print_timings:        eval time =   21564.50 ms /   486 runs   (   44.37 ms per token,    22.54 tokens per second)
llama_print_timings:       total time =   26229.33 ms /  3431 tokens



To solve for x in the equation 2^(x-3) = 3^(x-2), we can take the natural logarithm (base e) of both sides. Since the base does not matter when taking the natural logarithm, we can use the common logarithm instead:

log_e(2^(x-3)) = log_e(3^(x-2))

Using the power rule for logarithms, we have:

(x-3) * log_e(2) = (x-2) * log_e(3)

Now, we can set up an equation to solve for x:

(x-3) = (x-2) * log_e(3)/log_e(2)

Subtracting x from both sides and factoring out x, we get:

x - 3 = x * (log_e(3)/log_e(2) - 1)

Dividing both sides by (log_e(3)/log_e(2) - 1), we have:

x = 3 / [1 - log_e(3)/log_e(2)]

Using a calculator, we find that log_e(3)/log_e(2) ≈ 1.584962500721156. Substituting this value into the equation for x, we get:

x = 3 / [1 - 1.584962500721156]
x = 3 / (-0.584962500721156)
x = -5.14252461482798 / 0.584962500721156
x = \boxed{-8.7939734525132}.

Therefore, the solution of 2^(x-3) = 3^(x-2) is approximately x = -8.7939734525132.


##### Now, disable query expansion, set top_k to as high as your Gpu and context size enables (3 is a good value).

In order to properly format the retrieved problem-solution pairs into useful in-context examples we need to prompt the Chat Model (through the RetrieverQueryEngine call).

This can be done with the text_qa_template argument, which recieves a PromptTemplate class object. This object is a string that can be formated in 2 positions ```{context_str}``` and ```{query_str}```.

For example:
```
text_qa_template_str = (
    This is an example.
    The retrieved items will be put here {context_str}.
    While the current problem will be put here {query_str}.
)
text_qa_template = PromptTemplate(text_qa_template_str)

```

Your goal is to find an appropriate way to prompt your Chat Model so that it can properly utilize its context examples.


In [None]:
### Test it ###
query_str = "\nProblem:\nWhat is the solution of 2^{x-3} = 3^{x-2}?\nSolution:\n"
fusion_retriever = FusionRetriever(
   llm, [vector_retriever, bm25_retriever], similarity_top_k=3
)
text_qa_template_str = (
    "You are a mathematical solving companion. Your role is to analyze the following examples and use them to solve a new problem.\n"
    "Previous examples:\n{context_str}\n"
    "Please solve this new problem based on the examples provided:\n"
    "{query_str}"
)
text_qa_template = PromptTemplate(text_qa_template_str)

query_engine = RetrieverQueryEngine.from_args(fusion_retriever, llm=llm, text_qa_template=text_qa_template)
response = query_engine.query(query_str)
print(response.response)

100%|██████████| 2/2 [00:00<00:00, 59.42it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     272.92 ms /   512 runs   (    0.53 ms per token,  1875.98 tokens per second)
llama_print_timings: prompt eval time =    3698.07 ms /  2951 tokens (    1.25 ms per token,   797.98 tokens per second)
llama_print_timings:        eval time =   22872.78 ms /   511 runs   (   44.76 ms per token,    22.34 tokens per second)
llama_print_timings:       total time =   27569.96 ms /  3462 tokens


We can set up the equation 2^{x-3} = 3^{x-2}. Since 2 and 3 are not equal, we cannot directly compare their exponents. However, we can make them equal by setting their bases to be the same. We take the logarithm base 2 of both sides:
\begin{align*}
\log_2(2^{x-3}) &= \log_2(3^{x-2}) \\
(x-3)\log_2(2) &= (x-2)\log_2(3)
\end{align*}Now, we can solve for x by setting the logarithms equal:
\begin{align*}
(x-3)\log_2(2) &= (x-2)\log_2(3) \\
x\log_2(2)-3\log_2(2) &= x\log_2(3)-2\log_2(3) \\
x(\log_2(2)-\log_2(3)) &= 3\log_2(3)-2\log_2(3) \\
x(\log_2(2/3)) &= \log_2(3)^2-2\log_2(3)\log_2(3)+2\log_2(3) \\
x(\log_2(2/3)) &= \log_2(3)(3+\log_2(3))-\log_2(3) \\
x(\log_2(2/3)) &= \boxed{\log_2(3)^2+2\log_2(3)-\log_2(2/3)}
\end{align*}To find the exact value of x, we need to compute $\log_2(2/3)$. We can use a calculator or logarithm tables:
\begin{align*}
\log_2(2/3) &= \log_2(2)-\log_2(3) \\
&= 1-\log_2(3) \\
&\approx 0.4875
\end{align*}Now, we can find the value of x:
\begin{align*}
x(\log_2(2/3

#### Iterate over the first **35** questions of the Test dataset and calculate the respective performance.

#### **Perform the following check: Zero-Shot + Inception Prompting.**
Prepend the problem task with an inception prompt (You are a...) which is found to be benefitial for performance in reasoning and math tasks.

You will encounter the following  2 problems:
- Your model will probably answer a long analytic solution with multiple steps and numbers here and there. Fill the extract_last_floating_number so that you get the last real (positive or negative) number from a string, and apply it to your solutions.
- The MATH solutions are located at the end of the string enclosed in a \boxed{} LaTeX command. Fill the extract_answer_from_boxed so that you get the ground truth answer.

In [None]:
def extract_last_floating_number(text):
    """
    Extracts the last real number from a string.
    """
    numbers = re.findall(r"-?\d+\.?\d*", text)
    if numbers:
        return float(numbers[-1])
    return None

def extract_answer_from_boxed(expression):
    """
    Extracts the content within the \boxed{} LaTeX command.
    """
    mat = re.search(r"\\boxed{(.*)}", expression)
    if mat:
        return mat.group(1)
    return ""

In [None]:
### Iterate over 35 items of test set ###

fusion_retriever = FusionRetriever(
   llm, [vector_retriever, bm25_retriever], similarity_top_k=3
)
text_qa_template_str = (
    "You are a mathematical solving companion. Your role is to analyze the following examples and use them to solve a new problem.\n"
    "Previous examples:\n{context_str}\n"
    "Please solve this new problem based on the examples provided:\n"
    "{query_str}"
)
text_qa_template = PromptTemplate(text_qa_template_str)

query_engine = RetrieverQueryEngine.from_args(fusion_retriever, llm=llm, text_qa_template=text_qa_template)

def calculate_accuracy(test_set, num_items=35):
    correct_answers = 0

    for i, row in test_set.iterrows():
        if i >= num_items:
            break
        problem_str = 'Problem:\n' + row['problem'] + 'Solution:'
        response = query_engine.query(problem_str)
        llm_sol = response.response

        last_number = extract_last_floating_number(llm_sol)
        boxed_answer = extract_answer_from_boxed(row['solution'])

        print('LLM RESULT:',last_number)
        print('GT:', boxed_answer)

        try:
            if float(boxed_answer) == float(last_number):
              correct_answers += 1
        except ValueError:
            if last_number == boxed_answer:
                correct_answers += 1

    accuracy = correct_answers / num_items
    print(f"Accuracy: {accuracy:.2%}")

    return accuracy

# Run the accuracy calculation
test_set = pd.read_csv("math_dataset_test.csv")
test_accuracy = calculate_accuracy(test_set)

100%|██████████| 2/2 [00:00<00:00, 94.46it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     114.93 ms /   211 runs   (    0.54 ms per token,  1835.85 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    9818.07 ms /   211 runs   (   46.53 ms per token,    21.49 tokens per second)
llama_print_timings:       total time =   10196.44 ms /   211 tokens


LLM RESULT: 33.4
GT: 10


100%|██████████| 2/2 [00:00<00:00, 100.64it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      51.76 ms /    91 runs   (    0.57 ms per token,  1758.28 tokens per second)
llama_print_timings: prompt eval time =    4688.26 ms /  3489 tokens (    1.34 ms per token,   744.20 tokens per second)
llama_print_timings:        eval time =    4248.81 ms /    90 runs   (   47.21 ms per token,    21.18 tokens per second)
llama_print_timings:       total time =    9089.91 ms /  3579 tokens


LLM RESULT: 3.0
GT: 4


100%|██████████| 2/2 [00:00<00:00, 80.74it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      58.35 ms /   115 runs   (    0.51 ms per token,  1970.80 tokens per second)
llama_print_timings: prompt eval time =    4697.15 ms /  3457 tokens (    1.36 ms per token,   735.98 tokens per second)
llama_print_timings:        eval time =    5228.10 ms /   114 runs   (   45.86 ms per token,    21.81 tokens per second)
llama_print_timings:       total time =   10167.81 ms /  3571 tokens


LLM RESULT: 2.0
GT: 8


100%|██████████| 2/2 [00:00<00:00, 50.44it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     119.37 ms /   198 runs   (    0.60 ms per token,  1658.65 tokens per second)
llama_print_timings: prompt eval time =    4675.70 ms /  3477 tokens (    1.34 ms per token,   743.63 tokens per second)
llama_print_timings:        eval time =    9235.83 ms /   197 runs   (   46.88 ms per token,    21.33 tokens per second)
llama_print_timings:       total time =   14341.42 ms /  3674 tokens


LLM RESULT: 187.5
GT: 187.5


100%|██████████| 2/2 [00:00<00:00, 54.09it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      27.41 ms /    58 runs   (    0.47 ms per token,  2115.63 tokens per second)
llama_print_timings: prompt eval time =    4401.96 ms /  3364 tokens (    1.31 ms per token,   764.21 tokens per second)
llama_print_timings:        eval time =    2521.52 ms /    57 runs   (   44.24 ms per token,    22.61 tokens per second)
llama_print_timings:       total time =    7008.97 ms /  3421 tokens


LLM RESULT: 3125.0
GT: 3125


100%|██████████| 2/2 [00:00<00:00, 100.31it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     118.31 ms /   209 runs   (    0.57 ms per token,  1766.49 tokens per second)
llama_print_timings: prompt eval time =    3516.99 ms /  2716 tokens (    1.29 ms per token,   772.25 tokens per second)
llama_print_timings:        eval time =    9267.47 ms /   208 runs   (   44.56 ms per token,    22.44 tokens per second)
llama_print_timings:       total time =   13175.87 ms /  2924 tokens


LLM RESULT: 32.0
GT: 4


100%|██████████| 2/2 [00:00<00:00, 100.77it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     277.97 ms /   512 runs   (    0.54 ms per token,  1841.90 tokens per second)
llama_print_timings: prompt eval time =    4523.91 ms /  3419 tokens (    1.32 ms per token,   755.76 tokens per second)
llama_print_timings:        eval time =   23267.09 ms /   511 runs   (   45.53 ms per token,    21.96 tokens per second)
llama_print_timings:       total time =   28787.50 ms /  3930 tokens


LLM RESULT: 2.0
GT: 6


100%|██████████| 2/2 [00:00<00:00, 96.12it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      45.91 ms /    91 runs   (    0.50 ms per token,  1982.35 tokens per second)
llama_print_timings: prompt eval time =    4625.30 ms /  3479 tokens (    1.33 ms per token,   752.17 tokens per second)
llama_print_timings:        eval time =    4013.37 ms /    90 runs   (   44.59 ms per token,    22.43 tokens per second)
llama_print_timings:       total time =    8772.17 ms /  3569 tokens


LLM RESULT: 86.0
GT: 8


100%|██████████| 2/2 [00:00<00:00, 83.38it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.59 ms /   368 runs   (    0.56 ms per token,  1798.71 tokens per second)
llama_print_timings: prompt eval time =    4782.75 ms /  3590 tokens (    1.33 ms per token,   750.61 tokens per second)
llama_print_timings:        eval time =   17129.93 ms /   367 runs   (   46.68 ms per token,    21.42 tokens per second)
llama_print_timings:       total time =   22634.55 ms /  3957 tokens


LLM RESULT: 1.91
GT: 8


100%|██████████| 2/2 [00:00<00:00, 69.99it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      20.90 ms /    41 runs   (    0.51 ms per token,  1961.82 tokens per second)
llama_print_timings: prompt eval time =    4489.45 ms /  3425 tokens (    1.31 ms per token,   762.90 tokens per second)
llama_print_timings:        eval time =    1779.89 ms /    40 runs   (   44.50 ms per token,    22.47 tokens per second)
llama_print_timings:       total time =    6330.60 ms /  3465 tokens


LLM RESULT: 48.0
GT: 26


100%|██████████| 2/2 [00:00<00:00, 82.54it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      22.04 ms /    45 runs   (    0.49 ms per token,  2041.46 tokens per second)
llama_print_timings: prompt eval time =    4337.19 ms /  3326 tokens (    1.30 ms per token,   766.86 tokens per second)
llama_print_timings:        eval time =    1948.63 ms /    44 runs   (   44.29 ms per token,    22.58 tokens per second)
llama_print_timings:       total time =    6352.57 ms /  3370 tokens


LLM RESULT: 1.0
GT: 1


100%|██████████| 2/2 [00:00<00:00, 80.37it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      62.28 ms /   117 runs   (    0.53 ms per token,  1878.58 tokens per second)
llama_print_timings: prompt eval time =    4547.21 ms /  3478 tokens (    1.31 ms per token,   764.86 tokens per second)
llama_print_timings:        eval time =    5393.61 ms /   116 runs   (   46.50 ms per token,    21.51 tokens per second)
llama_print_timings:       total time =   10141.68 ms /  3594 tokens


LLM RESULT: 6.0
GT: 6


100%|██████████| 2/2 [00:00<00:00, 51.15it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     274.26 ms /   512 runs   (    0.54 ms per token,  1866.81 tokens per second)
llama_print_timings: prompt eval time =    3658.06 ms /  2836 tokens (    1.29 ms per token,   775.27 tokens per second)
llama_print_timings:        eval time =   22643.93 ms /   511 runs   (   44.31 ms per token,    22.57 tokens per second)
llama_print_timings:       total time =   27300.50 ms /  3347 tokens


LLM RESULT: 120.0
GT: 10


100%|██████████| 2/2 [00:00<00:00, 85.84it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      35.01 ms /    74 runs   (    0.47 ms per token,  2113.50 tokens per second)
llama_print_timings: prompt eval time =    4585.81 ms /  3547 tokens (    1.29 ms per token,   773.47 tokens per second)
llama_print_timings:        eval time =    3264.46 ms /    73 runs   (   44.72 ms per token,    22.36 tokens per second)
llama_print_timings:       total time =    7954.39 ms /  3620 tokens


LLM RESULT: 15.0
GT: 15


100%|██████████| 2/2 [00:00<00:00, 94.65it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     274.24 ms /   512 runs   (    0.54 ms per token,  1867.01 tokens per second)
llama_print_timings: prompt eval time =    4680.25 ms /  3512 tokens (    1.33 ms per token,   750.39 tokens per second)
llama_print_timings:        eval time =   23323.74 ms /   511 runs   (   45.64 ms per token,    21.91 tokens per second)
llama_print_timings:       total time =   29007.10 ms /  4023 tokens


LLM RESULT: 0.0
GT: -8


100%|██████████| 2/2 [00:00<00:00, 52.36it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      55.92 ms /   116 runs   (    0.48 ms per token,  2074.28 tokens per second)
llama_print_timings: prompt eval time =    4545.18 ms /  3442 tokens (    1.32 ms per token,   757.29 tokens per second)
llama_print_timings:        eval time =    5117.74 ms /   115 runs   (   44.50 ms per token,    22.47 tokens per second)
llama_print_timings:       total time =    9834.20 ms /  3557 tokens


LLM RESULT: 25.0
GT: 19


100%|██████████| 2/2 [00:00<00:00, 91.99it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      60.24 ms /   123 runs   (    0.49 ms per token,  2041.83 tokens per second)
llama_print_timings: prompt eval time =    4615.85 ms /  3470 tokens (    1.33 ms per token,   751.76 tokens per second)
llama_print_timings:        eval time =    5445.60 ms /   122 runs   (   44.64 ms per token,    22.40 tokens per second)
llama_print_timings:       total time =   10247.09 ms /  3592 tokens


LLM RESULT: -21.0
GT: 0


100%|██████████| 2/2 [00:00<00:00, 81.12it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      30.97 ms /    47 runs   (    0.66 ms per token,  1517.60 tokens per second)
llama_print_timings: prompt eval time =    4568.61 ms /  3461 tokens (    1.32 ms per token,   757.56 tokens per second)
llama_print_timings:        eval time =    2199.21 ms /    46 runs   (   47.81 ms per token,    20.92 tokens per second)
llama_print_timings:       total time =    6861.53 ms /  3507 tokens


LLM RESULT: 4.0
GT: 4


100%|██████████| 2/2 [00:00<00:00, 53.25it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      83.22 ms /   164 runs   (    0.51 ms per token,  1970.73 tokens per second)
llama_print_timings: prompt eval time =    4542.89 ms /  3473 tokens (    1.31 ms per token,   764.49 tokens per second)
llama_print_timings:        eval time =    7387.85 ms /   163 runs   (   45.32 ms per token,    22.06 tokens per second)
llama_print_timings:       total time =   12193.80 ms /  3636 tokens


LLM RESULT: 2.0
GT: 8


100%|██████████| 2/2 [00:00<00:00, 58.85it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      54.92 ms /   110 runs   (    0.50 ms per token,  2002.99 tokens per second)
llama_print_timings: prompt eval time =    4607.65 ms /  3496 tokens (    1.32 ms per token,   758.74 tokens per second)
llama_print_timings:        eval time =    4865.81 ms /   109 runs   (   44.64 ms per token,    22.40 tokens per second)
llama_print_timings:       total time =    9632.63 ms /  3605 tokens


LLM RESULT: 2.0
GT: 1


100%|██████████| 2/2 [00:00<00:00, 79.33it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      55.16 ms /   112 runs   (    0.49 ms per token,  2030.49 tokens per second)
llama_print_timings: prompt eval time =    3437.55 ms /  2420 tokens (    1.42 ms per token,   703.99 tokens per second)
llama_print_timings:        eval time =    5039.87 ms /   111 runs   (   45.40 ms per token,    22.02 tokens per second)
llama_print_timings:       total time =    8653.26 ms /  2531 tokens


LLM RESULT: 73.0
GT: 73


100%|██████████| 2/2 [00:00<00:00, 102.26it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      16.08 ms /    24 runs   (    0.67 ms per token,  1492.72 tokens per second)
llama_print_timings: prompt eval time =    4949.45 ms /  3609 tokens (    1.37 ms per token,   729.17 tokens per second)
llama_print_timings:        eval time =    1136.38 ms /    23 runs   (   49.41 ms per token,    20.24 tokens per second)
llama_print_timings:       total time =    6137.30 ms /  3632 tokens


LLM RESULT: -5.0
GT: -5


100%|██████████| 2/2 [00:00<00:00, 51.77it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     235.01 ms /   471 runs   (    0.50 ms per token,  2004.20 tokens per second)
llama_print_timings: prompt eval time =    4660.57 ms /  3573 tokens (    1.30 ms per token,   766.64 tokens per second)
llama_print_timings:        eval time =   21421.67 ms /   470 runs   (   45.58 ms per token,    21.94 tokens per second)
llama_print_timings:       total time =   26946.00 ms /  4043 tokens


LLM RESULT: 9.0
GT: 16


100%|██████████| 2/2 [00:00<00:00, 54.52it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      39.75 ms /    73 runs   (    0.54 ms per token,  1836.43 tokens per second)
llama_print_timings: prompt eval time =    4552.38 ms /  3449 tokens (    1.32 ms per token,   757.63 tokens per second)
llama_print_timings:        eval time =    3290.83 ms /    72 runs   (   45.71 ms per token,    21.88 tokens per second)
llama_print_timings:       total time =    7963.36 ms /  3521 tokens


LLM RESULT: 401.0
GT: 400


100%|██████████| 2/2 [00:00<00:00, 90.60it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      62.98 ms /   117 runs   (    0.54 ms per token,  1857.73 tokens per second)
llama_print_timings: prompt eval time =    3149.97 ms /  2275 tokens (    1.38 ms per token,   722.23 tokens per second)
llama_print_timings:        eval time =    5327.21 ms /   116 runs   (   45.92 ms per token,    21.77 tokens per second)
llama_print_timings:       total time =    8692.29 ms /  2391 tokens


LLM RESULT: 13.0
GT: 12


100%|██████████| 2/2 [00:00<00:00, 83.76it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      24.47 ms /    39 runs   (    0.63 ms per token,  1593.98 tokens per second)
llama_print_timings: prompt eval time =    4503.16 ms /  3456 tokens (    1.30 ms per token,   767.46 tokens per second)
llama_print_timings:        eval time =    1687.75 ms /    38 runs   (   44.41 ms per token,    22.52 tokens per second)
llama_print_timings:       total time =    6254.03 ms /  3494 tokens


LLM RESULT: 169.0
GT: 169


100%|██████████| 2/2 [00:00<00:00, 77.51it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      69.26 ms /   134 runs   (    0.52 ms per token,  1934.77 tokens per second)
llama_print_timings: prompt eval time =    4570.06 ms /  3415 tokens (    1.34 ms per token,   747.25 tokens per second)
llama_print_timings:        eval time =    5927.48 ms /   133 runs   (   44.57 ms per token,    22.44 tokens per second)
llama_print_timings:       total time =   10700.95 ms /  3548 tokens


LLM RESULT: 5.0
GT: 36


100%|██████████| 2/2 [00:00<00:00, 89.35it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      70.55 ms /   112 runs   (    0.63 ms per token,  1587.62 tokens per second)
llama_print_timings: prompt eval time =    4492.10 ms /  3438 tokens (    1.31 ms per token,   765.34 tokens per second)
llama_print_timings:        eval time =    5394.69 ms /   111 runs   (   48.60 ms per token,    20.58 tokens per second)
llama_print_timings:       total time =   10106.63 ms /  3549 tokens


LLM RESULT: 100.0
GT: 100


100%|██████████| 2/2 [00:00<00:00, 90.52it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     192.55 ms /   371 runs   (    0.52 ms per token,  1926.79 tokens per second)
llama_print_timings: prompt eval time =    4609.47 ms /  3583 tokens (    1.29 ms per token,   777.31 tokens per second)
llama_print_timings:        eval time =   16859.78 ms /   370 runs   (   45.57 ms per token,    21.95 tokens per second)
llama_print_timings:       total time =   22110.71 ms /  3953 tokens


LLM RESULT: 0.0
GT: 123


100%|██████████| 2/2 [00:00<00:00, 49.39it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      49.80 ms /   101 runs   (    0.49 ms per token,  2028.11 tokens per second)
llama_print_timings: prompt eval time =    4899.78 ms /  3594 tokens (    1.36 ms per token,   733.50 tokens per second)
llama_print_timings:        eval time =    4484.51 ms /   100 runs   (   44.85 ms per token,    22.30 tokens per second)
llama_print_timings:       total time =    9534.75 ms /  3694 tokens


LLM RESULT: 480.0
GT: 220


100%|██████████| 2/2 [00:00<00:00, 97.00it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     207.85 ms /   389 runs   (    0.53 ms per token,  1871.55 tokens per second)
llama_print_timings: prompt eval time =    4933.57 ms /  3628 tokens (    1.36 ms per token,   735.37 tokens per second)
llama_print_timings:        eval time =   17863.27 ms /   388 runs   (   46.04 ms per token,    21.72 tokens per second)
llama_print_timings:       total time =   23529.12 ms /  4016 tokens


LLM RESULT: 732.0
GT: 364


100%|██████████| 2/2 [00:00<00:00, 97.63it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      57.71 ms /    99 runs   (    0.58 ms per token,  1715.41 tokens per second)
llama_print_timings: prompt eval time =    4605.93 ms /  3550 tokens (    1.30 ms per token,   770.74 tokens per second)
llama_print_timings:        eval time =    4585.51 ms /    98 runs   (   46.79 ms per token,    21.37 tokens per second)
llama_print_timings:       total time =    9367.00 ms /  3648 tokens


LLM RESULT: 70.0
GT: 80


100%|██████████| 2/2 [00:00<00:00, 76.13it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      56.04 ms /   109 runs   (    0.51 ms per token,  1944.97 tokens per second)
llama_print_timings: prompt eval time =    4885.92 ms /  3635 tokens (    1.34 ms per token,   743.97 tokens per second)
llama_print_timings:        eval time =    4916.54 ms /   108 runs   (   45.52 ms per token,    21.97 tokens per second)
llama_print_timings:       total time =    9974.79 ms /  3743 tokens


LLM RESULT: 8.0
GT: 7


100%|██████████| 2/2 [00:00<00:00, 62.88it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     275.21 ms /   512 runs   (    0.54 ms per token,  1860.38 tokens per second)
llama_print_timings: prompt eval time =    4459.31 ms /  3384 tokens (    1.32 ms per token,   758.86 tokens per second)
llama_print_timings:        eval time =   23278.04 ms /   511 runs   (   45.55 ms per token,    21.95 tokens per second)
llama_print_timings:       total time =   28759.74 ms /  3895 tokens


LLM RESULT: 1.0
GT: 3


100%|██████████| 2/2 [00:00<00:00, 98.47it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =      24.03 ms /    46 runs   (    0.52 ms per token,  1914.51 tokens per second)
llama_print_timings: prompt eval time =    4550.73 ms /  3525 tokens (    1.29 ms per token,   774.60 tokens per second)
llama_print_timings:        eval time =    2017.21 ms /    45 runs   (   44.83 ms per token,    22.31 tokens per second)
llama_print_timings:       total time =    6640.17 ms /  3570 tokens


LLM RESULT: 16.0
GT: 800
Accuracy: 28.57%


#### Iterate over the first **35** questions of the Test dataset and calculate the respective performance.

#### **Perform the following check: FusionRAG + K-way.**
Now test the FusionRAG performance by performing queries to your RetrieverQueryEngine made in the previous step.

You need to fill the following infer_at_k function to perform the following things:
- Accept as an argument the incoming math problem and a parameter k.
- Retrieve the relevant in context examples from the FusionRetriever.
- Format the retrieved incontext examples and pass them to the RetrieverQueryEngine.
- Generate k responses from the RetrieverQueryEngine.
---------------------------------------------------------------------
- **Very Important:** Make sure you apply some post-processing to your responses.
You most probably need to truncate them, if you see that the Chat Model after answering your question mades up another problem and tries to solve it as well.
For example you might look for the keyword ```"Problem:"``` in the answer and keep everything before it.
Same can be said for patterns like  multiple empty new lines ```'\n\n\n'``` or ```'-----'```.
- Your function should return two lists: The strings that represent the truncated responses and a list with the answers as numbers (or None if no number could be extracted). The lists should each have length equal to k.

(Hint: Use the ```extract_last_floating_number``` to perform this)


In [None]:
def most_common_item(items):
    if not items:
        return 0
    counter = Counter(items)
    most_common, _ = counter.most_common(1)[0] if counter else (0, 0)
    return most_common

In [None]:
def postprocess_response(response):
    if 'Problem' in response:
        response = response.split('Problem', 1)[0]

    newline_split = re.split(r'\n\n\n', response)
    if len(newline_split) > 1:
        response = newline_split[0]

    dash_split = re.split(r'----+', response)
    if len(dash_split) > 1:
        response = dash_split[0]

    return response


def infer_at_k(problem, k=3):
    fusion_retriever = FusionRetriever(
        llm, [vector_retriever, bm25_retriever], similarity_top_k=3
        )
    text_qa_template_str = (
        "You are a mathematical solving companion. Your role is to analyze the following examples and use them to solve a new problem.\n"
        "Previous examples:\n{context_str}\n"
        "Please solve this new problem based on the examples provided:\n"
        "{query_str}"
    )
    text_qa_template = PromptTemplate(text_qa_template_str)
    RetrieverQueryEngine.from_args(fusion_retriever, llm=llm, text_qa_template=text_qa_template)
    responses = []
    full_text_responses = []

    for i in range(k):
        response = query_engine.query(query_str)
        sol = response.response
        processed_sol = postprocess_response(sol)
        responses.append(extract_last_floating_number(processed_sol))
        full_text_responses.append(processed_sol)
    return responses, full_text_responses

#### Iterate over the 35 test examples below ###
-----------------------------------------------------------------------------
Hint: It is quite useful - although not necessary - if you also save the generated responses and the generated numbers for each of the 35 questions into two pickle files named ```rag_test_responses.pt``` and ```rag_test_numbers.pt```

In [None]:
fusion_retriever = FusionRetriever(
   llm, [vector_retriever, bm25_retriever], similarity_top_k=3
)
text_qa_template_str = (
    "You are a mathematical solving companion. Your role is to analyze the following examples and use them to solve a new problem.\n"
    "Previous examples:\n{context_str}\n"
    "Please solve this new problem based on the examples provided:\n"
    "{query_str}"
)
text_qa_template = PromptTemplate(text_qa_template_str)

query_engine = RetrieverQueryEngine.from_args(fusion_retriever, llm=llm, text_qa_template=text_qa_template)

def calculate_accuracy_k_way(test_set, num_items=35):
    correct_answers = 0

    for i, row in test_set.iterrows():
        if i >= num_items:
            break
        problem_str = 'Problem:\n' + row['problem'] + 'Solution:'
        solutions, full_text_responses = infer_at_k(problem_str, k=3)

        last_number = most_common_item(solutions)
        boxed_answer = extract_answer_from_boxed(row['solution'])

        print('LLM RESULT:',last_number)
        print('GT:', boxed_answer)

        try:
            if float(boxed_answer) == float(last_number):
              correct_answers += 1
        except ValueError:
            if last_number == boxed_answer:
                correct_answers += 1

    accuracy = correct_answers / num_items
    print(f"Accuracy: {accuracy:.2%}")

    return accuracy

# Run the accuracy calculation
test_set = pd.read_csv("math_dataset_test.csv")
test_accuracy_k_way = calculate_accuracy_k_way(test_set)

100%|██████████| 2/2 [00:00<00:00, 54.85it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     214.04 ms /   407 runs   (    0.53 ms per token,  1901.51 tokens per second)
llama_print_timings: prompt eval time =    8682.33 ms /     2 tokens ( 4341.17 ms per token,     0.23 tokens per second)
llama_print_timings:        eval time =   18167.23 ms /   406 runs   (   44.75 ms per token,    22.35 tokens per second)
llama_print_timings:       total time =   18993.81 ms /   408 tokens
100%|██████████| 2/2 [00:00<00:00, 93.86it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     208.60 ms /   407 runs   (    0.51 ms per token,  1951.10 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18267.19 ms /   

LLM RESULT: 3.0
GT: 10


100%|██████████| 2/2 [00:00<00:00, 99.37it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     252.55 ms /   407 runs   (    0.62 ms per token,  1611.57 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   19477.92 ms /   407 runs   (   47.86 ms per token,    20.90 tokens per second)
llama_print_timings:       total time =   20396.54 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 50.14it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     239.26 ms /   407 runs   (    0.59 ms per token,  1701.08 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18985.26 ms /   

LLM RESULT: 3.0
GT: 4


100%|██████████| 2/2 [00:00<00:00, 45.03it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     227.29 ms /   407 runs   (    0.56 ms per token,  1790.64 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18739.87 ms /   407 runs   (   46.04 ms per token,    21.72 tokens per second)
llama_print_timings:       total time =   19596.78 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 90.39it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     203.92 ms /   407 runs   (    0.50 ms per token,  1995.92 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17975.42 ms /   

LLM RESULT: 3.0
GT: 8


100%|██████████| 2/2 [00:00<00:00, 60.85it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     213.35 ms /   407 runs   (    0.52 ms per token,  1907.70 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18327.97 ms /   407 runs   (   45.03 ms per token,    22.21 tokens per second)
llama_print_timings:       total time =   19092.80 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 79.94it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     215.94 ms /   407 runs   (    0.53 ms per token,  1884.82 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18167.49 ms /   

LLM RESULT: 3.0
GT: 187.5


100%|██████████| 2/2 [00:00<00:00, 83.66it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     228.94 ms /   407 runs   (    0.56 ms per token,  1777.76 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18437.24 ms /   407 runs   (   45.30 ms per token,    22.07 tokens per second)
llama_print_timings:       total time =   19253.27 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 91.91it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     206.37 ms /   407 runs   (    0.51 ms per token,  1972.21 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17968.21 ms /   

LLM RESULT: 3.0
GT: 3125


100%|██████████| 2/2 [00:00<00:00, 60.84it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.11 ms /   407 runs   (    0.50 ms per token,  1994.04 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18015.31 ms /   407 runs   (   44.26 ms per token,    22.59 tokens per second)
llama_print_timings:       total time =   18752.27 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 96.72it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     207.93 ms /   407 runs   (    0.51 ms per token,  1957.40 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18059.20 ms /   

LLM RESULT: 3.0
GT: 4


100%|██████████| 2/2 [00:00<00:00, 84.00it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     203.09 ms /   407 runs   (    0.50 ms per token,  2004.07 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17957.33 ms /   407 runs   (   44.12 ms per token,    22.66 tokens per second)
llama_print_timings:       total time =   18684.32 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 100.83it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     211.62 ms /   407 runs   (    0.52 ms per token,  1923.26 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18202.69 ms /  

LLM RESULT: 3.0
GT: 6


100%|██████████| 2/2 [00:00<00:00, 93.21it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     219.20 ms /   407 runs   (    0.54 ms per token,  1856.74 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18167.84 ms /   407 runs   (   44.64 ms per token,    22.40 tokens per second)
llama_print_timings:       total time =   18954.07 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 88.77it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     205.63 ms /   407 runs   (    0.51 ms per token,  1979.31 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17996.13 ms /   

LLM RESULT: 3.0
GT: 8


100%|██████████| 2/2 [00:00<00:00, 92.85it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     202.30 ms /   407 runs   (    0.50 ms per token,  2011.83 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17922.25 ms /   407 runs   (   44.04 ms per token,    22.71 tokens per second)
llama_print_timings:       total time =   18634.80 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 86.18it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     210.66 ms /   407 runs   (    0.52 ms per token,  1931.98 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18081.34 ms /   

LLM RESULT: 3.0
GT: 8


100%|██████████| 2/2 [00:00<00:00, 97.17it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     206.28 ms /   407 runs   (    0.51 ms per token,  1973.02 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17944.09 ms /   407 runs   (   44.09 ms per token,    22.68 tokens per second)
llama_print_timings:       total time =   18673.79 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 50.13it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     215.66 ms /   407 runs   (    0.53 ms per token,  1887.25 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18228.47 ms /   

LLM RESULT: 3.0
GT: 26


100%|██████████| 2/2 [00:00<00:00, 59.45it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     211.68 ms /   407 runs   (    0.52 ms per token,  1922.75 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18080.59 ms /   407 runs   (   44.42 ms per token,    22.51 tokens per second)
llama_print_timings:       total time =   18838.44 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 71.77it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     203.22 ms /   407 runs   (    0.50 ms per token,  2002.77 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17884.08 ms /   

LLM RESULT: 3.0
GT: 1


100%|██████████| 2/2 [00:00<00:00, 100.95it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.03 ms /   407 runs   (    0.50 ms per token,  1994.85 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17948.71 ms /   407 runs   (   44.10 ms per token,    22.68 tokens per second)
llama_print_timings:       total time =   18667.79 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 76.93it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     210.92 ms /   407 runs   (    0.52 ms per token,  1929.66 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18115.76 ms /  

LLM RESULT: 3.0
GT: 6


100%|██████████| 2/2 [00:00<00:00, 95.85it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     210.19 ms /   407 runs   (    0.52 ms per token,  1936.33 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17987.35 ms /   407 runs   (   44.19 ms per token,    22.63 tokens per second)
llama_print_timings:       total time =   18728.17 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 52.55it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     207.67 ms /   407 runs   (    0.51 ms per token,  1959.83 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18041.98 ms /   

LLM RESULT: 3.0
GT: 10


100%|██████████| 2/2 [00:00<00:00, 41.39it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     209.84 ms /   407 runs   (    0.52 ms per token,  1939.55 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18019.09 ms /   407 runs   (   44.27 ms per token,    22.59 tokens per second)
llama_print_timings:       total time =   18762.30 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 94.19it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     211.54 ms /   407 runs   (    0.52 ms per token,  1923.98 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17929.69 ms /   

LLM RESULT: 3.0
GT: 15


100%|██████████| 2/2 [00:00<00:00, 81.34it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     203.79 ms /   407 runs   (    0.50 ms per token,  1997.11 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17940.78 ms /   407 runs   (   44.08 ms per token,    22.69 tokens per second)
llama_print_timings:       total time =   18665.17 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 94.11it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     212.19 ms /   407 runs   (    0.52 ms per token,  1918.12 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18081.60 ms /   

LLM RESULT: 3.0
GT: -8


100%|██████████| 2/2 [00:00<00:00, 80.47it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     208.53 ms /   407 runs   (    0.51 ms per token,  1951.74 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18043.17 ms /   407 runs   (   44.33 ms per token,    22.56 tokens per second)
llama_print_timings:       total time =   18801.48 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 47.98it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     211.30 ms /   407 runs   (    0.52 ms per token,  1926.15 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18111.10 ms /   

LLM RESULT: 3.0
GT: 19


100%|██████████| 2/2 [00:00<00:00, 87.60it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     215.11 ms /   407 runs   (    0.53 ms per token,  1892.09 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18232.28 ms /   407 runs   (   44.80 ms per token,    22.32 tokens per second)
llama_print_timings:       total time =   19018.11 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 97.81it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     206.72 ms /   407 runs   (    0.51 ms per token,  1968.88 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17964.86 ms /   

LLM RESULT: 3.0
GT: 0


100%|██████████| 2/2 [00:00<00:00, 92.17it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     202.96 ms /   407 runs   (    0.50 ms per token,  2005.34 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17993.30 ms /   407 runs   (   44.21 ms per token,    22.62 tokens per second)
llama_print_timings:       total time =   18724.63 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 94.30it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     212.50 ms /   407 runs   (    0.52 ms per token,  1915.31 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18202.85 ms /   

LLM RESULT: 3.0
GT: 4


100%|██████████| 2/2 [00:00<00:00, 88.42it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     206.98 ms /   407 runs   (    0.51 ms per token,  1966.34 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18025.18 ms /   407 runs   (   44.29 ms per token,    22.58 tokens per second)
llama_print_timings:       total time =   18776.54 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 61.48it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     206.82 ms /   407 runs   (    0.51 ms per token,  1967.91 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18080.76 ms /   

LLM RESULT: 3.0
GT: 8


100%|██████████| 2/2 [00:00<00:00, 45.08it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     215.25 ms /   407 runs   (    0.53 ms per token,  1890.87 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18145.93 ms /   407 runs   (   44.58 ms per token,    22.43 tokens per second)
llama_print_timings:       total time =   18934.58 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 79.31it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     203.19 ms /   407 runs   (    0.50 ms per token,  2003.08 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   17957.83 ms /   

LLM RESULT: 3.0
GT: 1


100%|██████████| 2/2 [00:00<00:00, 94.75it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     211.54 ms /   407 runs   (    0.52 ms per token,  1924.00 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18261.98 ms /   407 runs   (   44.87 ms per token,    22.29 tokens per second)
llama_print_timings:       total time =   19106.00 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 90.42it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     214.35 ms /   407 runs   (    0.53 ms per token,  1898.77 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18364.92 ms /   

LLM RESULT: 3.0
GT: 73


100%|██████████| 2/2 [00:00<00:00, 49.49it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     216.11 ms /   407 runs   (    0.53 ms per token,  1883.34 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18345.89 ms /   407 runs   (   45.08 ms per token,    22.18 tokens per second)
llama_print_timings:       total time =   19219.90 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 57.66it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     208.33 ms /   407 runs   (    0.51 ms per token,  1953.59 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18091.41 ms /   

LLM RESULT: 3.0
GT: -5


100%|██████████| 2/2 [00:00<00:00, 43.29it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     209.60 ms /   407 runs   (    0.51 ms per token,  1941.78 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18210.33 ms /   407 runs   (   44.74 ms per token,    22.35 tokens per second)
llama_print_timings:       total time =   19041.00 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 96.28it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     209.91 ms /   407 runs   (    0.52 ms per token,  1938.97 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18107.85 ms /   

LLM RESULT: 3.0
GT: 16


100%|██████████| 2/2 [00:00<00:00, 77.93it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     206.84 ms /   407 runs   (    0.51 ms per token,  1967.67 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18087.00 ms /   407 runs   (   44.44 ms per token,    22.50 tokens per second)
llama_print_timings:       total time =   18883.44 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 92.87it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     218.97 ms /   407 runs   (    0.54 ms per token,  1858.67 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18323.03 ms /   

LLM RESULT: 3.0
GT: 400


100%|██████████| 2/2 [00:00<00:00, 71.89it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     217.45 ms /   407 runs   (    0.53 ms per token,  1871.66 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18271.96 ms /   407 runs   (   44.89 ms per token,    22.27 tokens per second)
llama_print_timings:       total time =   19113.91 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 94.75it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.99 ms /   407 runs   (    0.50 ms per token,  1985.47 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18082.80 ms /   

LLM RESULT: 3.0
GT: 12


100%|██████████| 2/2 [00:00<00:00, 78.65it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.96 ms /   407 runs   (    0.50 ms per token,  1985.79 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18096.00 ms /   407 runs   (   44.46 ms per token,    22.49 tokens per second)
llama_print_timings:       total time =   18903.90 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 91.53it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     220.79 ms /   407 runs   (    0.54 ms per token,  1843.36 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18479.28 ms /   

LLM RESULT: 3.0
GT: 169


100%|██████████| 2/2 [00:00<00:00, 88.78it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     210.29 ms /   407 runs   (    0.52 ms per token,  1935.40 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18165.56 ms /   407 runs   (   44.63 ms per token,    22.41 tokens per second)
llama_print_timings:       total time =   19009.11 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 58.45it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     207.34 ms /   407 runs   (    0.51 ms per token,  1962.99 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18127.06 ms /   

LLM RESULT: 3.0
GT: 36


100%|██████████| 2/2 [00:00<00:00, 45.04it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     209.49 ms /   407 runs   (    0.51 ms per token,  1942.85 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18195.43 ms /   407 runs   (   44.71 ms per token,    22.37 tokens per second)
llama_print_timings:       total time =   19034.02 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 66.16it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     208.25 ms /   407 runs   (    0.51 ms per token,  1954.35 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18088.10 ms /   

LLM RESULT: 3.0
GT: 100


100%|██████████| 2/2 [00:00<00:00, 101.92it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.27 ms /   407 runs   (    0.50 ms per token,  1992.49 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18019.94 ms /   407 runs   (   44.28 ms per token,    22.59 tokens per second)
llama_print_timings:       total time =   18801.07 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 94.01it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     215.73 ms /   407 runs   (    0.53 ms per token,  1886.64 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18247.07 ms /  

LLM RESULT: 3.0
GT: 123


100%|██████████| 2/2 [00:00<00:00, 98.12it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     212.93 ms /   407 runs   (    0.52 ms per token,  1911.44 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18202.77 ms /   407 runs   (   44.72 ms per token,    22.36 tokens per second)
llama_print_timings:       total time =   19018.37 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 74.65it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     229.65 ms /   407 runs   (    0.56 ms per token,  1772.30 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   19110.16 ms /   

LLM RESULT: 3.0
GT: 220


100%|██████████| 2/2 [00:00<00:00, 38.69it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     249.44 ms /   407 runs   (    0.61 ms per token,  1631.65 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   19224.21 ms /   407 runs   (   47.23 ms per token,    21.17 tokens per second)
llama_print_timings:       total time =   20215.17 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 29.68it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     249.08 ms /   407 runs   (    0.61 ms per token,  1634.03 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   19244.93 ms /   

LLM RESULT: 3.0
GT: 364


100%|██████████| 2/2 [00:00<00:00, 94.72it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     214.63 ms /   407 runs   (    0.53 ms per token,  1896.26 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18254.66 ms /   407 runs   (   44.85 ms per token,    22.30 tokens per second)
llama_print_timings:       total time =   19089.12 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 92.48it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     203.34 ms /   407 runs   (    0.50 ms per token,  2001.55 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18004.99 ms /   

LLM RESULT: 3.0
GT: 80


100%|██████████| 2/2 [00:00<00:00, 58.89it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.14 ms /   407 runs   (    0.50 ms per token,  1993.72 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18027.40 ms /   407 runs   (   44.29 ms per token,    22.58 tokens per second)
llama_print_timings:       total time =   18794.16 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 70.38it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     215.21 ms /   407 runs   (    0.53 ms per token,  1891.18 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18249.90 ms /   

LLM RESULT: 3.0
GT: 7


100%|██████████| 2/2 [00:00<00:00, 90.45it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     216.58 ms /   407 runs   (    0.53 ms per token,  1879.21 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18201.41 ms /   407 runs   (   44.72 ms per token,    22.36 tokens per second)
llama_print_timings:       total time =   19029.71 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 44.72it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     204.04 ms /   407 runs   (    0.50 ms per token,  1994.75 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18015.22 ms /   

LLM RESULT: 3.0
GT: 3


100%|██████████| 2/2 [00:00<00:00, 48.33it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     210.16 ms /   407 runs   (    0.52 ms per token,  1936.57 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18212.20 ms /   407 runs   (   44.75 ms per token,    22.35 tokens per second)
llama_print_timings:       total time =   19032.63 ms /   407 tokens
100%|██████████| 2/2 [00:00<00:00, 81.87it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time =     375.10 ms
llama_print_timings:      sample time =     206.99 ms /   407 runs   (    0.51 ms per token,  1966.25 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   18070.81 ms /   

LLM RESULT: 3.0
GT: 800
Accuracy: 2.86%


#### Answer the following questions:
- What would be an appropriate temperature value for trying multiple times to come up with an answer for the same question?
- There are questions where the model answers correctly in all of the tries, and others where the model might answer correctly only once among its different tries. What does this phenomenon tell us about the math capabilities of the tested model?
----------------------------------------------------------------------------
Answer:
In situations where you need to generate multiple answers for the same question using a moderate temperature is advantageous as it introduces some variety in the responses. I would propose something in between 0.4 and 0.7 as we are still dealing with mathematical problems and we don't want much randomness.

The variability in correct responses indicates that the models mathematical reasoning is not fully consistent. It shows that the model might not be reliably applying mathematical principles and rules in the same manner across different attempts. It might perform better in problems that have a suitable context and fail when the problem gets more complicated.

### Multi-Agent Setup with CrewAI

From a single LLM to multiple agents: You might have heard about different ways that the concept of an agent has been incorporated into LLM communities. The benefits of delegating tasks and personalisation of Agents provide significant boosts over a single LLM model.

Here you will use the [CrewAI library](https://docs.crewai.com/), to build a multi-agent system that will incorporating meta-cognition and error checking.

-----------------------------------------------------------------------------
Take a look at the documentation and examples of how to initialize an [Agent](https://docs.crewai.com/core-concepts/Agents/), a [Task](https://docs.crewai.com/core-concepts/Tasks/), a [Tool](https://docs.crewai.com/core-concepts/Tools/#creating-your-own-tools) and a [Crew](https://docs.crewai.com/core-concepts/Crews/).

These are the only components you need!

In [None]:
### Import Crew AI ###
!pip install 'crewai[tools]' -q
!pip install cohere -q
!pip install anthropic -q
!pip install -U langchain-anthropic
!pip install --upgrade langchain_experimental -q

import os
import numpy as np
from crewai import Crew, Process, Agent, Task
from langchain_community.chat_models import ChatCohere
from crewai_tools import BaseTool

#### Our Goal now is to create the following multi-agent setup:

RAG --> Chat Model Agent --> K-Answers --> Solution Analyzer --> Feedback --> Summary Writer --> Final Answer

Here:
* Solution Analyzer: Agent whose purpose is to look at the current problem and different ways to solve it, and choose the correct one out of them.
If there are no suggested ways to solve the problem, or the agent thinks that none of them is correct, the agent can suggest its own solution.

* Feedback: A string reflecting the response of the Solution Analyzer. Can be a step of solutions, a string saying "All the steps are correct" or anything else the Agent might respond.

* Summary Writer: Agent whose purpose is to look through the Feedback and decide what is the correct number value to be extracted from it as a solution. It will emulate the behaviour of the ```extract_last_floating_number``` function you created above but in amore context-aware manner.

* Final Answer: A string / float representing the best out of k possible ways of answering the math question according to the agent pipeline.
---------------------------------------------------------------------------
**Each Agent has access only to the output of the previous Agent, and the inputs of the current Task.**

---------------------------------------------------------------------------
##### **Important:** Since we have already performed the step up to K-answers (and possibly saved the results) there is no reason of redoing it. If thats the way you want to proceed look at the ```retrieve_at_k``` function below and modify it so that it does not make calls to the ChatModel, but rather accesses the saved pickles instead.


##### If you want to proceed with the retrieve_at_k function, there is no problem at all, you will just have to wait a bit more while evaluating the results.

-------------------------------------------------------------------------------
#### Now let's have a look at the following modified FusionRAG K-Way function:
It performs the same as previously, but now it drops None in extracted number solutions (and their respective analytical solutions) and returns a context describing our models various attempts into solving the problem, if no proposed solution was made by our ChatModel it asks for help.


In [None]:
def retrieve_at_k(argument, k=3):
    message_of_solution = ''
    message_of_no_solution = 'Unfortunately I have no idea how to solve this problem. Can you help me?'
    ### Filter out non-None responses ###
    numerical_responses, full_answers = infer_at_k(argument, k=k)

    final_responses =  []
    for n,f in zip(numerical_responses, full_answers):
        if n is None:
            pass
        else:
            final_responses.append(f)

    if len(final_responses) == 0:
        return message_of_no_solution
    else:
        for i in range(len(final_responses)):
            message_of_solution += f'\nSolution {i}:\n{final_responses[i]}\n'
    return message_of_solution

In [None]:
### Test it with a random problem ###


#### Create a MathTask class that implements as methods a set of custom tasks to be run by our agents.

Tasks:
* validation: The task of the Solution Analyzer. It needs to have access to the agent, the current task and the proposed solutions.

* summary: The task of the Summary Writer. It needs access only to the current agent.


In [None]:
from crewai import Task
from textwrap import dedent

class MathTasks():
  def validation(self, ### Fill the arguments
  ):
    return Task(description=dedent(f"""
        Fill this
      """),
      agent=agent,
      expected_output='Fill this'
    )

  def summary(self,  ### Fill the arguments
              ):
    return Task(description=dedent(f"""
        Fill this
      """),
      agent=agent,
      expected_output='Fill this'
    )

#### Make up your crew using **ONE** of the following:
* OpenAI (I recommend GPT3.5 Turbo)
* Cohere (I recommend command-r-plus)
* Anthropic (I recommend any model)

In [None]:
from langchain_community.chat_models import ChatCohere
from langchain_community.chat_models import ChatOpenAI
from langchain_anthropic import ChatAnthropic

#os.environ["OPENAI_API_KEY"] =
#os.environ["COHERE_API_KEY"] =
#os.environ["ANTHROPIC_API_KEY"] =

#agent_base_llm = ChatCohere(model='command-r-plus', temperature=0.2)
#agent_base_llm = ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo-1106")
#agent_base_llm = ChatAnthropic(temperature=0.2, model_name="claude-3-haiku-20240307")

#### **OR** Keep the same model but at a different wrapper using Langchain experimental.

-------------------------------------------------------------------------------
**Important:** If you choose this option you will need to release the GPU memory from the previous loaded model, you can do this by hitting Runtime-->Restart Session, load every import again and instead of the previous model load this model.
------------------------------------------------------------------------------

In [None]:
from langchain_experimental.chat_models import Llama2Chat

llm = ChatWrapperLlama(
    model_url=None,
    model_path='mistral-7b-instruct-v0.2.Q4_K_M.gguf',
    temperature=###,
    max_new_tokens=###,
    n_ctx=###, (This is the context length but with different argument name in this wrapper)
    n_gpu_layers= 32,
    top_p=0.95
)
llm.verbose=False

agent_base_llm = Llama2Chat(llm=llm)

In [None]:
MAX_RPM_GLOBAL = 100
N_AGENTS = 2

solution_analyzer = Agent(
  role=
  goal=
  backstory=
  llm=agent_base_llm,
  max_rpm=MAX_RPM_GLOBAL // N_AGENTS,
)

summary_writer = Agent(
  role=
  goal=
  backstory=
  llm=agent_base_llm,
  max_rpm=MAX_RPM_GLOBAL // N_AGENTS,
)

In [None]:
def pipeline(current_task, k=3):
    # Define the tasks in sequence
    proposed_solutions = ### FILL THIS and use K here
    analysis_task = MathTasks().validation(### FILL THIS)
    writing_task = MathTasks().summary(### FILL THIS)

    # Form the crew with a sequential process
    report_crew = Crew(### FILL THIS)
    # Execute tasks
    res = report_crew.kickoff()
    return res

### Measure Multi-Agent Performance




In [None]:
### Write code that iterates over the first 35 questions of the test dataset.

### Question: What is the performance difference? Did the validation agent help? Did you see any issues with the summary writer?
-------------------------------------------------------------------------------
Answer:


### Incorporate a Tutoring Mechanism.
Our pipeline will now look like this:

RAG --> Chat Model Agent --> K-Answers --> Solution Analyzer --> Feedback --> Tutor--> Ground Truth Hint --> Solution Analyzer --> Updated Feedback --> Summary Writer --> Final Answer

* Tutor: Agent that given the current task, the provided solutions and access to the ground truth answer (from ```'./math_dataset_test.csv'```) will provide the Solution Analyzer with hints regarding the correctness of their answer. Try to make the Agent not reveal the correct answer if you can (optional).

* The Solution Analyzer will be then engaged again in a new task called reflect where they should reflect on the tutors hint and decide on a final answer.

------------------------------------------------------------------------------
**Important**: To implement access to the ground truth data, you need to:

Implement a Class SoftTutorDB that uses the embedding model you have loaded, to encode the ground truth problems in the test CSV (problems only). Then given a new problem it needs to return the top 1 solution (it is guranteed to find a solution). Then use the provided TutorTool below and incorporate it into your pipeline.

This option is not guranteed to work well if a non-API model has been chosen for the multi-agent pipeline. So you will be graded on basis of implementation and not actuall performance for this task


In [None]:
class SoftTutorDB:
    def __init__(self, file='./math_dataset_test.csv', problem_index='problem', solution_index='solution'):
        self.db = pd.read_csv(file)
        self.pi = problem_index
        self.si = solution_index
        self._encode_problems()

    def _encode_problems(self):
        self.keys =
        self.values =

    def get(self, query):



class TutorTool(BaseTool):
    name: str = "Tutoring Tool"
    description: str = "Given a math problem this tool returns the correct solution to it."
    db: object = SoftTutorDB()
    def _run(self, problem: str) -> str:
        return self.db.get(problem)

#### Expand your MathTask class to implement 2 more tasks: tutoring and reflection.

----------------------------------------------------------------------------
What items does each task need access to? Remember that each Agent has access only to the output of the previous Agent, and the input of the Task.


In [None]:
class MathTasks():
    def validation()

    def summary()

    def tutoring()

    def reflect()

In [None]:
analyst = Agent(
    role=
    goal=
    backstory=,
    tools = [],
    llm=agent_base_llm,
    max_rpm=1,
)

tutor = Agent(
  role=
  goal=
  backstory=,
  tools = [TutorTool()],
  llm=agent_base_llm,
  max_rpm=1,
)

writer = Agent(
    role=
    goal=
    backstory=,
    tools = [],
    llm=agent_base_llm,
    max_rpm=1,
)

In [None]:
def pipeline(current_task, k=3):
    # Define the tasks in sequence
    proposed_solutions = ### FILL THIS and use K here
    analysis_task = MathTasks().validation(### FILL THIS)
    writing_task = MathTasks().summary(### FILL THIS)

    # Form the crew with a sequential process
    report_crew = Crew(### FILL THIS)
    # Execute tasks
    res = report_crew.kickoff()
    return res

In [None]:
def pipeline_with_tutoring(current_task, k):
    # Define the tasks in sequence
    proposed_solutions = ### FILL THIS and use K here
    analysis_task = MathTasks().validation(### FILL THIS)
    tutoring_task = MathTasks().tutoring(### FILL THIS)
    reflection_task = MathTasks().reflect((### FILL THIS, Hint: Use context=[analysis_task, tutoring_task] as an extra argument)
    writing_task = MathTasks().summary(### FILL THIS)

    # Form the crew with a sequential process
    report_crew = Crew()
    # Execute tasks
    res = report_crew.kickoff()
    return res

In [None]:
### Write code that iterates over the first 35 questions of the test dataset.