# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [None]:
# Run on kaggle

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
# %cd [change to the directory you prefer]

[Errno 2] No such file or directory: '[change to the directory you prefer]'
/kaggle/working


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [None]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp310-cp310-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m301.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m130.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.p

In [None]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [None]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [None]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [None]:
# # You can try out different questions here.
# test_question='請問誰是 Taylor Swift？'

# messages = [
#     {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
#     {"role": "user", "content": test_question}, # User prompt
# ]

# print(generate_response(llama3, messages))

## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [None]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # System prompt  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # User prompt  # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO 1: Design the role description and task description for each agent.

In [None]:
# TODO 1: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是問題分析專家，擅長把使用者輸入的複雜敘述轉換成清晰的問題。你只會使用繁體中文來回答問題。",
    task_description="請從以下內容抽取出明確的問題敘述，用一句話描述問題，你只能用「問題:」開頭:",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="你是一個關鍵字擷取專家，擅長從問題中擷取用於網路搜尋的關鍵字。你只會使用繁體中文來回答問題。",
    task_description="請從以下問題中提取出適合用在網路搜尋的所有關鍵字，關鍵字一定來自問題，不要自己亂加不在問題中的詞彙，只給我關鍵字就好，每個關鍵字用空格分開:",
)

# This agent is the core component that answers the question.

# qa_agent = LLMAgent(
#     role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回答問題。",
#     task_description="請回答以下問題：",
# )

qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回答問題。",
    task_description="如果有給你參考資料，就根據參考資料回答問題。如果沒有給參考資料，就用你知道的知識回答問題。",
)

## RAG pipeline

TODO 2: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [None]:
async def pipeline(question: str) -> str:
    # TODO 2: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.

    if len(question) < 20:
        return qa_agent.inference(question)

    # Upper line
    keywords = keyword_extraction_agent.inference(question)
    search_results = await search(keyword=keywords, n_results=3)
    ref = "\n".join(search_results)[:13000]


    # Lower line
    extracted_question = question_extraction_agent.inference(question)


    # Ask QA Agent
    if ref == "":
        message = "請回答以下" + extracted_question
    else:
        message = "參考資料：" + ref + "\n請回答以下" + extracted_question
    answer = qa_agent.inference(message)


    print(f"keywords: {keywords}\n")
    for i in range(0, len(search_results)):
        print(f"search result {i+1}: {search_results[i]}\n")
    print(f"extracted question: {extracted_question}\n")
    print(f"answer: {answer}\n\n\n\n")


    return answer

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [None]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "r13922186"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]  # separate the question and ground truth answer
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        print(id, "\n")
        answer = await pipeline(question)  # generate answer
        answer = answer.replace('\n',' ')
        # print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        print(id, "\n")
        answer = await pipeline(question)  # generate answer
        answer = answer.replace('\n',' ')
        # print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

1 

keywords: 校歌 代表 歌曲 學院 小學 中學校 大學院 校治理念 辦教 理想 等

search result 1: 校歌-维基百科，自由的百科全书跳转到内容主菜单主菜单移至侧栏隐藏导航首页分类索引特色内容新闻动态最近更改随机条目特殊页面帮助帮助维基社群方针与指引互助客栈知识问答字词转换IRC即时聊天联络我们关于维基百科搜索搜索外观资助维基百科创建账号登录个人工具资助维基百科创建账号登录未登录编辑者的页面了解详情贡献讨论目录移至侧栏隐藏序言1分类开关分类子章节1.1按来源1.2按学校类型2中国大陆的校歌发展历史开关中国大陆的校歌发展历史子章节2.1近代早期校歌2.2五四后的校歌2.3抗战时期校歌3校歌之最4参考文献5外部链接6参见开关目录校歌4种语言English日本語한국어粵語编辑链接条目讨论简体不转换简体繁體大陆简体香港繁體澳門繁體大马简体新加坡简体臺灣正體阅读编辑查看历史工具工具移至侧栏隐藏操作阅读编辑查看历史常规链入页面相关更改上传文件固定链接页面信息引用此页获取短链接下载二维码打印/导出下载为PDF打印版本在其他项目中维基共享资源维基数据项目外观移至侧栏隐藏维基百科，自由的百科全书此条目论述以中国大陆为主，未必有普世通用的观点。请协助补充内容以避免偏颇，或讨论本文的问题。校歌为学校（包括小学、中学、大学等）宣告或者规定的代表该校的歌曲。用于体现该校的治学理念、办学理想等学校文化。一所学校可能不止一首校歌，同一首歌也可能被不止一所学校定为校歌，而且也有未指定校歌的学校。根据学校的不同还可能称为园歌（幼儿园）、院歌（学院）等，英文对应也有SchoolSong、CollegeSong、UniversitySong等称法。分类[编辑]按来源[编辑]专门创作：专门作为校歌而创作的歌曲，可能词曲皆为原创，可能用已有曲调填词，也可能用已有诗词等作为歌词谱曲成歌。如：中国人民抗日军事政治大学的校歌《抗日军政大学校歌》词曲即皆为专门为该校创作。继承：学校合并、分裂、改制等时，可能会继承原校的校歌。如：中国人民解放军国防大学的校歌《抗日军政大学校歌》就是继承自该校前身中国人民抗日军事政治大学。北京清华大学与新竹国立清华大学都使用1923年所创作的《清华大学校歌》。借用：直接使用已有歌曲作为校歌的院校也有。比如，一些专业性院校借用行业歌曲作为校歌。如：1990

In [None]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)