# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
%cd /content/drive/MyDrive/ML/hw1


/content/drive/MyDrive/ML/hw1


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [None]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m103.5 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m156.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.me

In [None]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [None]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [None]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [None]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和音樂製作人。她出生於1989年，來自田納西州。她的音乐风格从乡村摇滚发展到流行搖擺，並且她被誉为当代最成功的女艺人的之一。

泰勒絲早期在鄉郊小鎮演唱會時開始發展音樂事業，她推出了多張專輯，包括《Taylor Swift》、《Fearless》，以及後來更為知名的大熱作如 《1989》（2014年）、_reputation（）和 _Lover （）。她的歌曲經常探討愛情、友誼及自我成長等主題。

泰勒絲獲得了許多獎項，包括13座格萊美奖，並且是史上最快達到百萬銷量的女藝人之一。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [None]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [None]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是 LLaMa-3.1-8B，是用來擷取問題的 AI。使用中文時只會使用繁體中文來回問題。你要將收到的問題中不重要的資訊去除，你要確保與原本問題相關的條件和資訊都沒有喪失。你不要根據你自己的知識回答問題或是加入任何資訊。你的回答應該是一個可以和原本問題得到相同答案的簡潔問句。例如，如果原本的問題是「細胞是構成生物的基本單位，關於細胞的研究已經為人類在生命科學領域帶來了極大的啟發，至今仍有重大的突破。請問細胞的主要成份為何？」，你應該回答簡化過後的問題「請問細胞的主要成份為何？」",
    task_description="請將收到的問題中與問題本身無關的資訊和內容去除，並在不改變原本問題意思的情況下提供盡量簡潔的問句，你的回答應該只含有問句。",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="你是 LLaMa-3.1-8B，是用來擷取關鍵字的 AI。根據收到的問題擷取出關鍵字，關鍵字應該是原來的問題中就有的詞彙。使用中文時只會使用繁體中文來回問題。回答問題時請將不同關鍵字以一個空格分開。不要根據你的知識回答問題，不要在關鍵字中加入原本的問題中沒有的資訊，只要分析問題並提出關鍵字就好。你要確保原本的問題中的資訊和條件都被包含在關鍵字裡面。例如，如果原本的問題是「三菱集團的創始人是誰」，你應該擷取出關鍵字「三菱集團 創始人」。如果原本的問題是「nba 1999賽季得分王是誰」，你應該擷取出關鍵字「nba 1999 得分王」",
    task_description="請根據收到的問題擷取出方便進行網路搜尋的關鍵字，在不損失問題含義的情況下越精簡越好。",
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。你只要針對問題給出答案就好，不用回答多餘的內容。",
    task_description="請根據你的知識和網路搜尋的結果，給出對以下問題最合理的答案。",
)

# This agent only gives answer based on the knowledge of the LLM
truth_agent = LLMAgent(
  role_description="你是 LLaMA-3.1-8B，是用來提供資訊的 AI。你只會根據你所知道的事實回答問題，如果不知道就要說「我不知道」，你的回答只需要含有問題的答案。使用中文時只會使用繁體中文來回問題。",
  task_description="請根據以下問題，根據你知道的事實回答。回答要符合指定的格式。"
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [None]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    concise_question: str = question_extraction_agent.inference(question);
    print(f"concise question: {concise_question}");
    truth_agent_response: str = truth_agent.inference(concise_question);
    keywords: str = keyword_extraction_agent.inference(concise_question);
    print(f"keywords: {keywords}");
    search_result: List[str] = await search(keyword=keywords, n_results=3);
    combined_search_result: str = "";
    for result in search_result:
      if len(result) > 3000:
        continue;
      combined_search_result = combined_search_result + result + " ";
    print(f"serach result: {combined_search_result}");
    qa_prompt:str = f"問題：{question},  搜尋結果：{combined_search_result} "
    print(f"final promp: {qa_prompt}");
    return qa_agent.inference(qa_prompt)

## Answer the questions using your pipeline!

In [None]:
%rm b12901022*

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [None]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "B12901022"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            print("fo")
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

concise question: 「虎山雄風飛揚」是哪間學校的校歌？
keywords: 虎山雄風飛揚 校歌
serach result: 校歌南投縣南投市光華國小-精華區lyrics-批踢踢實業坊批踢踢實業坊›精華區betalyrics關於我們聯絡資訊返回上層作者ypx0409(ypx0409)看板lyrics標題校歌南投縣南投市光華國小時間TueSep1603:13:342008貓羅溪旁虎山雄風飛揚這兒是我們生長的地方良師益友濟濟一堂克勤克儉奮發圖強長我育我飲水思源立定志向四海名揚忠心復國是理想光華意志堅光華氣勢昂你我相共勉誓把華夏重光--※發信站:批踢踢實業坊(ptt.cc)◆From:122.116.179.102推vibrant224:國小校歌推...09/1611:10推sognare:校友友情推09/1614:09 
final promp: 問題：校歌為學校（包括小學、中學、大學等）宣告或者規定的代表該校的歌曲。用於體現該校的治學理念、辦學理想等學校文化。「虎山雄風飛揚」是哪間學校的校歌歌詞？,  搜尋結果：校歌南投縣南投市光華國小-精華區lyrics-批踢踢實業坊批踢踢實業坊›精華區betalyrics關於我們聯絡資訊返回上層作者ypx0409(ypx0409)看板lyrics標題校歌南投縣南投市光華國小時間TueSep1603:13:342008貓羅溪旁虎山雄風飛揚這兒是我們生長的地方良師益友濟濟一堂克勤克儉奮發圖強長我育我飲水思源立定志向四海名揚忠心復國是理想光華意志堅光華氣勢昂你我相共勉誓把華夏重光--※發信站:批踢踢實業坊(ptt.cc)◆From:122.116.179.102推vibrant224:國小校歌推...09/1611:10推sognare:校友友情推09/1614:09  
1 光華國小
concise question: 2025年初，NCC規定民眾透過境外郵購自用產品加收審查費多少錢？
keywords: NCC 2025 境外郵購審查費
serach result: 網購國外3C產品需繳納750元審查費NCC為何調整規範？｜典藏新聞｜TAAA｜台北市廣告代理商業同業公會TAIPEIASSOCIATIONOFADVERTISINGAGENCIES台北市廣告代理商業同業公會MENU加入會員／登入登入會員登入加入公會會員關

In [None]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)
            print(answer)

光華國小
根據搜尋結果，2025年初，由於民眾透過境外郵購無線鍑盤、滑鼠藍芽耳機等自用產品回台，每案一律加收審查費750元。
史蒂夫·乔布斯
托福網路測驗（TOEFL iBT）需達到 72 分以上。
觸地 try 可得 7 分。
ruvuwa'an位於現今的台東縣太麻里鄉美和村。
熊仔的碩班指導教授為台大電機系。
迈克尔·法拉第
臺鐵康樂站
20+30=50
根據搜尋結果，達拉斯獨行俠隊的當家球星Luka Doncic被交易至洛杉磯湖人。
根據搜尋結果，2024年美國總統大選的勝出者是賀錦麗（Kamala Harris），她代表民主黨參與競爭，並在沒有其他對手的情況下成為候选人。
根據我的知識和網路搜尋結果，Meta 的 Llama-3.2 系列模型中，最小的參數量是 7Billion。
依據國立臺灣大學學則，每個停修的限制是沒有明確規定，但根 據其他學校（如清華大学）的資料，研究生每年最多可以申請兩門課程。
我無法找到相關的搜尋結果。
波士頓塞爾提克隊
三键化合物
圖靈（Alan Mathison Turing）
南投縣名間鄉
微軟公司。
根據搜尋結果，官將首的起源不明確，但可以知道「官方」、「公務員」的意思。
《咒》的邪神名為北帝。
短暫交會的旅程就此分岔是動力火車歌曲路人甲中的詞句。
根據搜尋結果，2025年卑南族聯合年的活動將由利嘉部落主辦。
根據我的知識和網路搜尋結果，最新的輝達顯卡是出到「GeForce RTX 40系列」。
大S是在泰國旅遊時因病去世。
牛頓是發現萬有引力的科學家。根據他的著作《自然哲学的数学原理》，他在1687年提出万有的物体都互相吸附，彼此之间存在一种普遍而永恒不变的心灵力，即所谓“天地之中无一粒沙，不以六合为界”的萬有引力的概念。
TAIHUCAIS
《终结者》
水的化學式為 H₂O。
很抱歉，我無法提供李宏毅教授在《機器學習》2023年春季班中第15個作業的名稱，因為這些資訊可能不公開或難以找到。
目前臺灣公立的獨립學院僅剩一間，即國防醫學大學（原名為中華民国軍事医学院）。
BitTorrent 協議使用的機制是 DHT（Distributed Hash Table，分散式雜湊表）和 Tracker 伺服器。當一個新的節點加入網路時，它可以通過以下步驟獲得部分資料：  1. 首先，我們需要了解 Bit Torrent 的基本原理