# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
%ls
%cd /content/drive/MyDrive/colabhw/

[0m[01;34mdrive[0m/  [01;34msample_data[0m/
/content/drive/MyDrive/colabhw


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [5]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m199.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metad

In [6]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [7]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [8]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

In [11]:
#test Code by use the search function
results = await search("牛顿 万有引力定律", n_results=3)
for i, txt in enumerate(results, 1):
    print(f"--- 结果 {i} ---")
    print(txt[:120], "...\n")

--- 结果 1 ---
牛顿万有引力定律-维基百科，自由的百科全书跳转到内容主菜单主菜单移至侧栏隐藏导航首页分类索引特色内容新闻动态最近更改随机条目帮助帮助维基社群方针与指引互助客栈知识问答字词转换IRC即时聊天联络我们关于维基百科特殊页面搜索搜索外观资助维基百科 ...

--- 结果 2 ---
3.3:牛顿万有引力定律-GlobalSkiptomaincontentTogglesTableofContentsMenumenusearchSearchbuild_circleToolsfact_checkHomeworkcancelE ...

--- 结果 3 ---
视频:牛顿万有引力定律JoVEJoVE科研行为学生物化学生物学生物工程癌症研究化学发育生物学工程学环境科学遗传学免疫与感染医学神经科学JoVE杂志JoVE实验百科全书JoVEChromeExtension教育生物学化学Clinical工程学 ...



## Test the LLM inference pipeline

In [21]:
# You can try out different questions here.
test_question='who is Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文和日语两种语言來回答問題，生成这两种语言的答案。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

**中文答案：**

泰勒·斯威夫特（Taylor Swift）是一位美国歌手、词曲作家和音乐制作人。她以她的情感丰富的流行摇滚乐风格而闻名。出生于1989年，Swift从小就对演唱会有着深厚的情怀，她在10岁时开始写自己的第一首单独创意歌谣，并且她也曾经是美国乡村音乐电视台（CMT）的一位常客。

斯威夫特的职业生涯始于2005年，当她的第一个专辑《泰勒·史维芙》发行。然而，她真正走红是在2010年的第二张專輯「Fearless」之后，該張唱片獲得了多項獎项，並且她也因此成为第一位在20岁之前获得两座格莱美奖的歌手。

**日语答案：**

テイラー・スウィフト（Taylor Swift）は、アメリカのシンガーソングライター、作曲家そして音楽プロデュースャーである。彼女は感情豊かなポップロックサウンドで知られており、そのキャリアでは数多くのヒットを生み出している。

1989年に誕生物まれ後すぐに歌手としての道へ進むとともに関西地方などでも活動し、10歳から最初の一曲を作り始めていた。彼女はCMT（カントリー・ミュージックテレビジョン）で活躍することもあった。そのキャリアを通して数多くの賞を受けている。

2010年にリードシングル「白い狼」がヒットし、デビューから5年目にして初めてのアルバム『フレイヤー』は大きな成功をおさめ、その後も彼女の人気と影響力だけではなく音楽的にも成長を続けており、今では世界中で有名なアーティストの一人となっている。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [56]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [61]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是一个专业的问题分析专家，擅长从复杂问题中提取核心内容。",
    task_description="请分析以下问题，提取核心问题内容，保持问题的核心含义。如果问题已经简洁明了，请直接返回原问题。",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="你是一个专业的关键词提取专家，擅长从问题中识别最重要的搜索关键词。",
    task_description="请从以下问题中提取3-5个最重要的搜索关键词，用空格分隔。关键词应该能准确反映问题的核心内容，便于网络搜索找到相关信息。",
)


# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。",
    task_description="请根据你的知识库和提供的相关信息，全面、准确地回答以下问题。如果相关信息与问题相关，请充分利用；如果信息不足，请基于你的知识进行回答。直接输出最终答案的内容。不用输出中间的分析过程和不相关的信息。",
)

answer_verification_agent = LLMAgent(
    role_description="你是一个专业的答案检查专家。你的任务是对于被提出的问题，确保AI生成的答案是准确的,符合逻辑的。",
    #task_description="请仔细检查以下答案是否正确，如果有错误请修正答案，并直接输出最终修正答案的内容，如果没有错误，直接输出最终答案的内容。",
    task_description="请检查核对答案是否正确的符合逻辑的回答了被提出的问题，如果有错误请修正答案，并直接输出最终修正答案的内容，如果没有错误，直接输出最终答案的内容。不用输出中间的分析过程和不相关的信息。"
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [72]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.

    # Step 1: 问题清理
    cleaned_question = question_extraction_agent.inference(question);
    #print(cleaned_question)
    # Step 2: 关键词提取
    keywords = keyword_extraction_agent.inference(cleaned_question);
    #print(keywords)

    # Step 3: 网络搜索
    # Reduce the number of search results or truncate the content to fit within the context window
    search_results = await search(keywords, n_results=3); # Reduce n_results to 3 as a starting point
    # Truncate each search result to a reasonable length, e.g., 2000 characters
    truncated_search_results = [result[:2000] for result in search_results];
    enhanced_question = f"问题：{cleaned_question}\n\n相关信息：{' '.join(truncated_search_results)}";
    #print(enhanced_question)
    # Step 4: 生成初步答案
    initial_answer = qa_agent.inference(enhanced_question);
    #print(initial_answer)

    # Step 5: 答案质量检查;
    #final_answer = answer_verification_agent.inference(initial_answer);
    #print(final_answer)

    #return final_answer
    return initial_answer

In [None]:
# test -Agent system for 1 message
test_question1 = "校歌為學校（包括小學、中學、大學等）宣告或者規定的代表該校的歌曲。用於體現該校的治學理念、辦學理想等學校文化。「虎山雄風飛揚」是哪間學校的校歌歌詞？,光華國小"
answer1 = await pipeline(test_question1)
answer1 = answer.replace('\n',' ')
print(answer1)

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [75]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "li4agent_answer22"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

1 "虎山雄風飛揚 "是台北市立中學的校歌。
2 我無法提供相關的資訊。
3 第一代 iPhone 是由史蒂夫·乔布斯（Steve Jobs）發表。
4 根據提供的資訊，托福網路測驗（TOEFL iBT）總分為120 分。考生需要達到一定得數才能申請進階英文免修。  一般而言，大多数学校对 TOFEL 考试成绩有以下要求：  *   C1 级别：90-110     *  这意味着你能理解复杂的学术文章，包括大量信息和困难词汇。              在听力方面，你可以了解在大学环境中进行的一般性对话或演讲，并且能够识出主要观点、细节以及说话者的态度。  *   B2 级别：70-89     *  这意味着你能理解大多数学术文章的主旨和重要信息，但可能会遇到一些困难。              在听力方面，你可以了解在大学环境中进行的一般性对话或演讲，并且能够识出主要观点、细节以及说话者的态度。  *   B1 级别：50-69     *  这意味着你能理解简单的学术文章，但可能会遇到一些困难。              在听力方面，你可以了解在大学环境中进行的一般性对话或演讲，并且能够识出主要观点、细节以及说话者的态度。  *   A2 级别：30-49     *  这意味着你可能会遇到一些困难。              在听力方面，你可以了解在大学环境中进行的一般性对话或演讲，并且能够识出主要观点、细节以及说话者的态度。  *   A1 级别：0-29     *  这意味着你可能会遇到一些困难。              在听力方面，你可以了解在大学环境中进行的一般性对话或演讲，并且能够识出主要观点、细节以及说话者的态度。  因此，考生需要達到的托福網路測驗（TOEFL iBT）總分為90-110才能申請進階英文免修。
5 在Rugby Union中，觸地try可得7分。
6 我找不到相關的信息。
7 熊仔的碩士指導教授為李琳山。
8 詹姆斯·克拉ーク・マ克斯韦尔。
9 距離國立臺灣史前文化博物館最近的臺鐵車站為康樂駅。
10 20+30=50
11 洛杉矶湖人队
12 由于相关信息中没有提到2024年美国总统大选的胜出者，因此无法提供准确答案。
13 Llama3.2系列中，參數量最小的模型是1B参数。
14 依據國立臺灣大學學則，沒有報告書的情

In [None]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)

In [81]:
# This agent is the core component that answers the question.
# test performance by 1 qa_agent
qa_agent_test = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。",
    task_description="請回答以下問題：",
)

async def pipeline_test1(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.

    # Step 1: 问题清理

    initial_answer = qa_agent_test.inference(question);
    #print(initial_answer)

    # Step 5: 答案质量检查;
    #final_answer = answer_verification_agent.inference(initial_answer);
    #print(final_answer)

    #return final_answer
    return initial_answer


# Fill in your student ID first.
STUDENT_ID_test = "li4agent_answer55"

STUDENT_ID_test = STUDENT_ID_test.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID_test}_{id}.txt").exists():
            continue
        answer = await pipeline_test1(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID_test}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID_test}_{id}.txt").exists():
            continue
        answer = await pipeline_test1(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID_test}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)


1 "虎山雄風飛揚 " 是國立臺灣師範大學的校歌。
2 對不起，我無法提供關於2025年初NCC規定的具體數字。
3 第一代 iPhone 是由史蒂夫·乔布斯（Steve Jobs）發表的。
4 根據台灣大學的進階英文免修申請規定，托福網路測驗 TOEFL iBT 的成績需達到 80 分以上才能符合資格。
5 在橄欖球聯盟（Rugby Union）中，觸地試踢得 5 分。
6 根據歷史記載，卑南族的祖先發源自ruvuwa'an，這個地點被認為是位於現今臺灣省臺東縣太麻里鄉附近。
7 根據我的知識，熊仔的碩士指導教授是黃韻玲。
8 答案：詹姆斯·克拉ーク・マ克斯韋爾。
9 距離國立臺灣史前文化博物館最近的臺鐵車站是台東火车駅。
10 答案是：50
11 對不起，我無法提供關於未來的 NBA 賽事資訊，因為我只知道到 2023 年底。
12 對不起，我無法提供關於未來事件的確切信息，特別是政治選舉結果。2024年美國總統大选尚在進行中，並且各個候任人士仍然競爭當時。我可以幫助你了解更多有关于美国总统竞争和历史上相关事項嗎？
13 根據我的知識，Llama-3.1 系列模型中參數量最小的為 LLaMA 7B（約有七十億個参数），但我無法確認是否能找到更低版本。
14 根據國立臺灣大學學則，停修有更嚴格的限制。依照規定，在沒有師生雙方同意的情況下，每個班級每年最多只能申請兩門課程停止上课（即「退選」），但如果是因為個人原因需要暫時休息或轉學等情形，則可以向學校提出停修申请。
15 對不起，我無法找到任何有關DeepSeek公司的相關資訊。
16 對不起，我無法提供最新的資訊或未來事件。2024年NBA總冠軍隊伍尚待確認，請關注官方新聞和賽事結果以獲得最準確且更新至今的情況報告
17 這類化合物稱為烯（Alkene）。
18 答案是阿倫·圖靈（Alan Turing）。他是一位英國數學家、計算機科學院院士和密碼破譯者，被認為是在20世紀中期對現代電腦理論做出重大貢獻的人。他的「图灵机」概念奠定了计算理论的基礎，並且他提出的圖靈測試也成為人工智能研究中的重要指標之一。他在二戰期間參與密碼破譯工作，對於德國的情报機構Enigma進行分析和攻擊做出了重大貢獻。
19 根據我的知識，臺灣玄天上帝信仰的進香中心位於新北市瑞芳區。
20 Windows 作業系統是微軟公司（Microso

In [None]:
# Combine the results into one file.
with open(f'./{STUDENT_ID_test}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID_test}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)