##### 問題：「明明有結論章節，但找不到結論」

##### 可能的原因：
- 檢索失敗：向量檢索根本沒找到結論相關的chunks
- 檢索成功但LLM誤判：找到了結論內容，但LLM說「沒有結論」
- chunk邊界問題：結論被切得支離破碎，失去語義
- embedding問題：「結論」和「conclusion」的embedding距離太遠

# 一、檢查 check laoder → 正常

In [1]:
# 建立整體的ui介面，變成一個問答機器人
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA,  ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader



In [2]:
import os 
from openai import OpenAI 
from dotenv import load_dotenv, find_dotenv 
_ = load_dotenv(find_dotenv()) 
client = OpenAI(
    api_key=os.environ['OPENAI_API_KEY']
)
print("done")

done


In [3]:
from dotenv import load_dotenv
import os

if 'OPENAI_API_KEY' in os.environ:
    del os.environ['OPENAI_API_KEY']

load_dotenv()

api_key = os.environ.get('OPENAI_API_KEY')
if api_key:
    print("✅ Success! API key loaded")
    print(f"Key starts with: {api_key[:15]}...")
    print(f"Key length: {len(api_key)} characters")
else:
    print("❌ Still not working")

✅ Success! API key loaded
Key starts with: sk-proj-FWUUter...
Key length: 164 characters


In [6]:
# 載入一篇
folder_path = "/Users/mangtinglee/Desktop/2025_gap_careerpath/RAG_LLM/pdfs"
pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]
print(f"Found {len(pdf_files)} PDF files:")

test_file = os.path.join(folder_path, pdf_files[0])
loader = PyPDFLoader(test_file)
documents = loader.load()

print(f"Test file loaded: {len(documents)} pages")
print(f"內容：{documents[0].page_content[:500]}...")
print(f"metadata：{documents[0].metadata}")

Found 23 PDF files:
Test file loaded: 11 pages
內容：Large Language Models Sensitivity to The Order of Options in
Multiple-Choice Questions
Pouya Pezeshkpour
Megagon Labs
pouya@megagon.ai
Estevam Hruschka
Megagon Labs
estevam@megagon.ai
Abstract
Large Language Models (LLMs) have demon-
strated remarkable capabilities in various NLP
tasks. However, previous works have shown
these models are sensitive towards prompt
wording, and few-shot demonstrations and
their order, posing challenges to fair assess-
ment of these models. As these models be-
come ...
metadata：{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-08-23T01:15:54+00:00', 'author': '', 'keywords': '', 'moddate': '2023-08-23T01:15:54+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/Users/mangtinglee/Desktop/2025_gap_careerpath/RAG_LLM/pdfs/2023_LLM limitation_Large La

# 二、檢查 split → 已經切好，但不知道是否正確

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=150,
    separators=[ "\n\n",  "\n", ". ", "(?<=\. )", " ", ""]
    )    
docs = text_splitter.split_documents(documents)
print(len(docs)) # chunks
print(len(documents)) # test 11 pages

50
11


In [10]:
# 檢查chunk是否合理
def analyze_chunks(docs):
    lengths = [len(d.page_content) for d in docs]
    
    print(f"Total chunks: {len(docs)}")
    print(f"Average length: {sum(lengths)/len(lengths):.0f}")
    print(f"Min length: {min(lengths)}")  
    print(f"Max length: {max(lengths)}")
    print(f"Length std dev: {(sum((x-sum(lengths)/len(lengths))**2 for x in lengths)/len(lengths))**0.5:.0f}")
    
    # 檢查是否有太短的chunk（可能是切割錯誤）
    short_chunks = [i for i, l in enumerate(lengths) if l < 200]
    if short_chunks:
        print(f"Warning: {len(short_chunks)} chunks are very short")
        
analyze_chunks(docs)

Total chunks: 50
Average length: 901
Min length: 193
Max length: 996
Length std dev: 179


In [None]:
# 三、embedding → 

In [11]:
# 建立資料庫路徑，已有路徑則可忽略
import os

# 建立資料夾
os.makedirs('./chroma_db', exist_ok=True)
print("finish！")

# 檢查是否成功
print(f"資料夾存在嗎？{os.path.exists('./chroma_db')}")

finish！
資料夾存在嗎？True


In [12]:
# define embedding
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [13]:
# 注意，需要先在自己的環境中建立資料庫路徑
persist_directory = './chroma_db' # 指定資料庫路徑
!rm -rf ./chroma_db  # remove old database files if any

In [14]:
# 建立新的向量資料庫，並將文件放進去
from langchain.vectorstores import Chroma
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory=persist_directory
)

In [15]:
# 檢查剛剛的向量行數是否與塊數相等
print(vectordb._collection.count())

50


In [20]:
question = "can you explain the abstract of this article?"
ans_docs = vectordb.similarity_search(question,k=3)
print(len(ans_docs))
print(ans_docs[2].page_content)

3
sensitivity gap coverage ranging from 20% to 72%,
while the mitigating bias pattern ranges from 0.9%
to 38%. These results validate the effectiveness
of the identified pattern for both amplifying and
mitigating bias. Additionally, in most cases, the
amplifying pattern covers a considerably greater
portion of the sensitivity gap comparing to the mit-
igating pattern. It is important to highlight that the
patterns we have identified for amplifying bias can
serve as valuable insights for enhancing model per-
formance or launching adversarial attacks against
them. Furthermore, the patterns we have estab-
lished for mitigating bias can play a crucial role in
shaping benchmark design and guiding annotating
efforts to create less biased benchmarks.
5 Calibrating LLMs for MCQ Tasks
We conduct an in-depth investigation into how large
language models react to changes in the order of
options, and investigate the reasons behind their
sensitivity to such changes. Through our explo-


In [21]:
# 看看documents裡有沒有Abstract
for i, doc in enumerate(documents):
    if 'abstract' in doc.page_content.lower():
        print(f"Found abstract in document {i}:")
        print(doc.page_content[:500])
        print("---")

Found abstract in document 0:
Large Language Models Sensitivity to The Order of Options in
Multiple-Choice Questions
Pouya Pezeshkpour
Megagon Labs
pouya@megagon.ai
Estevam Hruschka
Megagon Labs
estevam@megagon.ai
Abstract
Large Language Models (LLMs) have demon-
strated remarkable capabilities in various NLP
tasks. However, previous works have shown
these models are sensitive towards prompt
wording, and few-shot demonstrations and
their order, posing challenges to fair assess-
ment of these models. As these models be-
come 
---
Found abstract in document 1:
In this paper, we investigating the sensitivity of
LLMs to the order of options in multiple-choice
questions; using it as a proxy to understand LLMs
sensitivity to the order of prompt elements in
instruction- or demonstration-based paradigm. We
demonstrate an example of GPT-4’s sensitivity to
options order in Figure 1, using a sample from the
CSQA benchmark (Talmor et al., 2018). Notably,
by merely rearranging the placement of opti

In [22]:
# 看看第一頁到底載入了什麼
first_page = documents[0]
print("First page content:")
print(first_page.page_content[:1000])

First page content:
Large Language Models Sensitivity to The Order of Options in
Multiple-Choice Questions
Pouya Pezeshkpour
Megagon Labs
pouya@megagon.ai
Estevam Hruschka
Megagon Labs
estevam@megagon.ai
Abstract
Large Language Models (LLMs) have demon-
strated remarkable capabilities in various NLP
tasks. However, previous works have shown
these models are sensitive towards prompt
wording, and few-shot demonstrations and
their order, posing challenges to fair assess-
ment of these models. As these models be-
come more powerful, it becomes imperative
to understand and address these limitations.
In this paper, we focus on LLMs robust-
ness on the task of multiple-choice questions—
commonly adopted task to study reasoning and
fact-retrieving capability of LLMs. Investigat-
ing the sensitivity of LLMs towards the order
of options in multiple-choice questions, we
demonstrate a considerable performance gap
of approximately 13% to 75% in LLMs on dif-
ferent benchmarks, when answer options ar