# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [73]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122


In [74]:
import torch
if not torch.cuda.is_available():
    print('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!') #because i use mak - -
else:
    print('You are good to go!')

You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!


using macbook ... :)

In [75]:
import torch
if torch.backends.mps.is_available():
    mps_device = torch.device("mps")
    x = torch.ones(1, device=mps_device)
    print (x)
else:
    print ("MPS device not found.")

tensor([1.], device='mps:0')


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [76]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_a

## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [77]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:90]
    keyword = f"{keyword} \n -site:wikipedia.org "
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh-TW", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [78]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和音樂製作人。她出生於1989年，來自田納西州。她的音乐风格从乡村摇滚发展到流行搖擺，並且她被誉为当代最成功的女艺人的之一。

泰勒絲早期在鄉郊小鎮演唱會時開始發展音樂事業，她推出了多張專輯，包括《Taylor Swift》、《Fearless》，以及後來更為知名的大熱作品——「1989」。她以其寫作歌詞的能力和對情感表達深入人心而聞name。

泰勒絲獲得了許多少項獎座，如格萊美、美國唱片業協會（RIAA）等，並且她的專輯銷量超過1億張。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [79]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [80]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="""請根據以下要求，將輸入的文字萃取出等價的問題：
                        - 目標：把輸入的問題改成意義相同，但只用一句話的句子
                        - 警告：不允許擅自添加任何詞語，也不允許你回答你認為的答案，例如："最近報導中，常提及台灣大學的美食眾多，請問在台灣大學的麥當勞最好吃的是什麼？"，若回答"最近在台灣大學的麥當勞最好吃的是什麼？"則是正確的，但若回答"台灣大學的麥當勞最好吃的是什麼雞塊？"，"台灣大學的麥當勞最好吃的是什麼薯條"，"台灣大學的麥當勞最好吃的薯條問題是？"，"問題：台灣大學的麥當勞最好吃的薯條是？"，"最近報導中，常提及台灣大學的...眾多，請問...麥當勞最好吃的是什麼？"，等等此類回覆皆為錯誤，因為不允許擅自添加詞語，請直接利用文字中的句子即可
                        - 特殊要求：如果問題有特別的輸出格式，請保留該段話，例如："最近報導中，常提及台灣大學的美食眾多，請問在台灣大學的麥當勞最好吃的是什麼？請回答漢堡與對應套餐"，若回答"最近在台灣大學的麥當勞最好吃的是什麼？請回答漢堡與對應套餐"則是正確的
                        - 結構：保留輸入的問題裡面，靠近問號的地方詢問的問題，以及整體文字中所代表的關鍵字，以及詢問的主題（例如：學校或地名）
                        - 範例：“家庭教師動漫中，有如獵人和火影忍者一樣刺激的打鬥畫面，請問家庭教師中專著黑色西裝，並且功力高深的人是誰？” -> 這段話就要輸出 ”家庭教師中專著黑色西裝，並且功力高深的人是誰？“
                        - 輸出限制：使用中文時只會使用繁體中文來回問題，不准用簡體中文，非常重要，並且你只會使用原本問題裡的句子回答，如果是英文專有名詞，可以附上。""",
    task_description="找出以下段落所問的問題，用一句話呈現：",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="""請根據以下要求，將輸入的文字萃取出等價的問題關鍵字：
                        - 目標：把輸入的問題改成意義相同，但只用30個字以內呈現的句子，並且皆為問題關鍵字，請特別留意問題時間
                        - 特殊要求：如果問題有特別討論的時間，請保留該段話，例如："最新的iphone手機出到第幾代？"，這段話就要輸出 "最新iphone手機 第幾代"，如果回答 "iphone手機 第幾代"那麼會是不正確的
                        - 警告：不允許擅自添加任何詞語，也不允許你回答你認為的答案，例如："最近報導中，常提及台灣大學的美食眾多，請問在台灣大學的麥當勞最好吃的是什麼？"，若回答"最近在台灣大學的麥當勞最好吃的是什麼？"則是正確的，但若回答"台灣大學的麥當勞最好吃的是什麼雞塊？"，"台灣大學的麥當勞最好吃的是什麼薯條"，"台灣大學的麥當勞最好吃的薯條問題是？"，"問題：台灣大學的麥當勞最好吃的薯條是？"，等等此類回覆皆為錯誤，因為不允許擅自添加詞語，請直接利用文字中的句子即可
                        - 結構：保留輸入的問題裡面的結構，特別是靠近問號的地方詢問的問題，以及是整體問題強調的詢問議題（例如「」內的文字很特殊，請保留），如果是很簡單的問題請直接輸出
                        - 範例：“家庭教師動漫中，有如獵人和火影忍者一樣刺激的打鬥畫面，請問家庭教師中專著黑色西裝，並且功力高深的人是誰？” -> 這段話就要輸出 ”家庭教師動漫 黑色西裝 功力高深的人“
                        - 範例：“台大電機系張耀文教授在2025年報導中指出，台灣大學某課程停修率上升，與「108課綱」相關，請問該課程為何？” -> 這段話就要輸出 ”台大電機系張耀文 台大停修率上升 課程“
                        - 範例：“請問200*909=?” -> 這段話就要輸出 ”請問200*909=?“，因為他是很短的問題，並且請不要任意添加不屬於原本問題的詞語
                        - 特殊要求：最多只能用30個字回答，總共數個關鍵字，以及特定年份
                        - 輸出限制：請注意，不允許擅自添加並非屬於原本文字的詞語，使用中文時只會使用繁體中文來回問題，不准用簡體中文，非常重要，並且你只會使用原本問題裡的句子回答，如果是英文專有名詞，可以附上。""",
    task_description="找出以下段落要問的問題，用30個字以內呈現：",
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是一位只會用五個字以內回答問題的專家，當你看完問題後，你會看到相關資料，看完之後根據對應的問題回答出足夠好的答案，你只需要講出答案，不需要做過多的修飾或解釋，能用一個名詞來符合回答問題是最好的。使用中文時只會使用繁體中文來回問題，可以用數字回答，但有單位要加上單位，專有名詞可以用英文。",
    task_description="用最精簡的方式回答問題，比方說一個名詞而已，請記住使用中文時只會使用繁體中文來回問題，不准用簡體中文，殘體中文輸出答案，以下為問題與資料：",
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [81]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    key = keyword_extraction_agent.inference(question)
    q_ext =  question_extraction_agent.inference(question)
    #print(key)
    #print(q_ext)
    results  =  await search(key)
    #print(results)
    question = f"以下為問題摘要：{q_ext}\n以下為資料：{results}"
    question = question[:15000]
    return qa_agent.inference(question)

In [82]:
# You can try out different questions here.
#question='Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data論文中提出的模型是甚麼名字？'
#answer = await pipeline(question)
#answer = answer.replace('\n',' ')
#print(answer)

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [83]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "B12901140"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

1 光華國小
2 750
3 史蒂夫·賈伯斯
4 TOEFL iBT
5 觸地得5分
6 卑南族的祖先發源自巴布亞新幾內雅
7 李琳山


CancelledError: 

In [None]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)