# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
%cd "./drive/MyDrive/BS/大五/ML/"

/content/drive/MyDrive/BS/大五/ML


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [None]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m182.9 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.meta

In [None]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [None]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [None]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
import re
urllib3.disable_warnings()

def filter_garbled(text: str) -> str:
    """
    This function removes characters that are not in typical ranges for Chinese,
    Latin letters, numbers, or common punctuation.
    """
    # The regex below allows:
    # - Chinese characters (\u4e00-\u9fff)
    # - Latin letters and digits (A-Za-z0-9)
    # - Whitespace and common punctuation
    pattern = re.compile(r'[^\u4e00-\u9fffA-Za-z0-9\s,\.!?;:"“”‘’()（）\-]')
    return pattern.sub('', text)

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

def window_contains_keyword(text: str, keyword: str, window_size: int = 200) -> bool:
    """
    Slide a window over the text and check if any window contains the keyword.
    The check is case-insensitive.
    """
    keyword_lower = keyword.lower()
    text_lower = text.lower()
    keyword = keyword.split(" ")
    if len(text) <= window_size:
        return keyword_lower in text_lower
    # Use overlapping windows (step is half the window size).
    step = max(window_size // 2, 1)
    for i in range(0, len(text) - window_size + 1, step):
        window = text_lower[i:i+window_size]
        for keyw in keyword:
          # print("keyw = ", keyw)
          if keyw in window:
              # print("here")
              return True
    return False


async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    print("keyword = ", keyword)
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh-tw", unique=True))
    # print("intermediate results = ", results)
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # print("result after get_htmls = ", results)
    # Filter out the None values.
    html_results = [x for x in results if x is not None]
    # Parse the HTML.
    soups = [BeautifulSoup(html, 'html.parser') for html in html_results]
    final_results = []
    for soup in soups:
        text = ''.join(soup.get_text().split())
        # Only proceed if the text encoding is UTF-8.
        if detect(text.encode()).get('encoding') != 'utf-8':
            continue
        # Remove unwanted characters.
        cleaned_text = filter_garbled(text)
        # Check if the text looks like random gibberish.
        # print("cleaned_text = ", cleaned_text)
        # Ensure that at least one sliding window of text contains the keyword.
        # if not window_contains_keyword(cleaned_text, keyword, 5000):
        #     continue
        final_results.append(cleaned_text[:5000])

    return final_results[:n_results]
    # results = [BeautifulSoup(x, 'html.parser') for x in results]

    # # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    # results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # # Return the first n results.
    # return results[:n_results]

## Test the LLM inference pipeline

In [None]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和音樂製作人。她出生於1989年，來自田納西州。她的音乐风格从乡村摇滚发展到流行搖擺，並且她被誉为当代最成功的女艺人的之一。

泰勒絲早期在鄉郊小鎮演唱會時開始發展音樂事業，她推出了多張專輯，包括《Taylor Swift》、《Fearless》，以及後來更為知名的大熱作如 《1989》（2014年）、_reputation（）和 _Lover （）。她的歌曲經常探討愛情、友誼及自我成長等主題。

泰勒絲獲得了許多獎項，包括13座格萊美奖，並且是史上最快達到百萬銷量的女藝人之一。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [None]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str,search_results:List=[]) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            if len(search_results) > 0:
              messages = [
                  {"role": "system", "content": f"{self.role_description} 必須使用繁體中文回答問題。"},
                  {
                      "role": "user",
                      "content": f"任務描述：\n{self.task_description}\n\n搜尋結果：\n{search_results}\n\n使用者問題：\n{message}",
                  },
              ]
            # Format the messsages first.
            else:
              messages = [
                  {"role": "system", "content": f"{self.role_description}" + " 必須使用繁體中文回答問題。"},
                  {"role": "user", "content": f"任務描述：\n{self.task_description}\n\n使用者訊息：\n{message}"},
              ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [None]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="您是一位專業的問題摘要專家，專精於將長問題摘要成簡潔的核心問題。",
    task_description="請仔細閱讀以下長問題，並將其摘要成一個更短、更精煉的版本。請確保摘要能表達提問重點。請勿回答問題本身。請僅輸出摘要後的短問題本身。請使用繁體中文。",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="您是一位專業的關鍵詞提取專家，專精於從長問題中提取出最能代表其核心提問的關鍵詞。",
    task_description="請仔細閱讀以下長問題，並提取出最能代表其核心提問的關鍵詞。請確保這些關鍵詞能概括提問重點。請僅輸出以空格分隔的關鍵詞本身。請使用繁體中文。",
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="您是一位專業的問答系統，專精於根據提供的搜尋結果和使用者問題，提供準確的答案。",
    task_description="請根據以下提供的搜尋結果和使用者問題，回答使用者的問題。請確保答案基於搜尋結果。請僅輸出問題的答案本身。請使用繁體中文。",
)

qa_agent_simple = LLMAgent(
    role_description="您是一位專業的問答系統，專精於根據使用者的問題，提供準確的答案。",
    task_description="請根據以下提供的使用者問題，正確回答問題，並僅輸出問題的答案本身。請使用繁體中文。",
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [None]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    # print("original question = ", question)
    if(len(question) > 10):
      keywords = keyword_extraction_agent.inference(question,[])
      abstract = question_extraction_agent.inference(question,[])
      search_results = await search(keywords,3)
      # print("keywords = ", keywords)
      # print("abstract = ", abstract)
      # print("serach_results = ", search_results[:][:10])
      ans = qa_agent.inference(abstract,search_results)
    else:
      ans = qa_agent_simple.inference(question, [])

    return ans

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [None]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "b09901142"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)
        # break

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

original question =  在《遊戲王》卡牌遊戲中，以「真紅眼黑龍」與「黑魔導」作為融合素材的融合怪獸是什麼？

keyword =  真紅眼 黑魔導 鐵甲龍
keywords =  真紅眼 黑魔導 鐵甲龍
abstract =  《遊戲王》中以「真紅眼黑龍」與 「 黑魔導 」作為融合素材的哪隻怪獸？
47 真紅眼黑龍
original question =  豐田萌繪在《BanG Dream!》企劃中，擔任哪個角色的聲優？

keyword =  豐田萌繪  BanG Dream! 角色 聲優
keywords =  豐田萌繪  BanG Dream! 角色 聲優
abstract =  豐田萌繪在《BanG Dream!》中聲演哪個角色？
48 松原花音
original question =  Rugby Union 中，9 號球員的正式名稱為何？

keyword =  Scrum半back
keywords =  Scrum半back
abstract =  九號球員的正式名稱是掃劍手。
49 Scrum-half。
original question =  曾被視為太陽系中的行星，最終被降格成矮行星的星球為何？

keyword =  矮行星
keywords =  矮行星
abstract =  什麼是曾被視為太陽系中的行星，最終卻遭到降格成矮天體的那顆恥辱之球？
50 冥王星
original question =  以往政府對動物保護的觀念僅停留在寵物，因此動保法又被調侃為可愛動物保護法，近年來政策逐漸重視野生動物的保護。臺灣最早成立的野生動物救傷單位位於哪個行政區內？

keyword =  野生動物  臺灣政府政策重視保護
keywords =  野生動物  臺灣政府政策重視保護
abstract =  臺灣最早成立的野生動物救傷單位位於哪個行政區內？
51 台北市立動物園
original question =  位於南投縣集集鎮的特生中心是親子育樂的好去處，館內以臺灣本土生態及生物為主軸，規劃高、中、低海拔生態系、特有動物、特有植物、環境-生物-人、自然保育、植物的奧秘及動物的奇觀等主題展區。特生中心在2023年改名，目前該單位的名字為？

keyword =  特生中心  南投縣集集中    親子育樂   本土生物
keyw

In [None]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)