# RAG(Retrieval Augmented Generation)
- [RAG](https://python.langchain.com/v0.1/docs/modules/data_connection/)은 *Retrieval Augmented Generation*의 약자로, **검색 기반 생성 기법**을 의미한다. 이 기법은 LLM이 특정 문서에 기반하여 보다 정확하고 신뢰할 수 있는 답변을 생성할 수 있도록 돕는다.     
- 사용자의 질문에 대해 자체적으로 구축한 데이터베이스(DB)나 외부 데이터베이스에서 질문과 관련된 문서를 검색하고, 이를 질문과 함께 LLM에 전달한다.
- LLM은 같이 전달된 문서를 바탕으로 질문에 대한 답변을 생성한다. 
- 이를 통해 LLM이 학습하지 않은 내용도 다룰 수 있으며, 잘못된 정보를 생성하는 환각 현상(*hallucination*)을 줄일 수 있다.

## RAG와 파인튜닝(Fine Tuning) 비교

### 파인튜닝(Fine Tuning)

- **정의**: 사전 학습(pre-trained)된 LLM에 특정 도메인의 데이터를 추가로 학습시켜 해당 도메인에 특화된 맞춤형 모델로 만드는 방식이다.
- **장점**
  - 특정 도메인에 최적화되어 높은 정확도와 성능을 낼 수 있다.
- **단점**
  - 모델 재학습에 많은 시간과 자원이 필요하다.
  - 새로운 정보가 반영되지 않으며, 이를 위해서는 다시 학습해야 한다.

### RAG

- **정의**: 모델을 다시 학습시키지 않고, 외부 지식 기반에서 정보를 검색하여 실시간으로 답변에 활용하는 방식이다.
- **장점**
  - 최신 정보를 쉽게 반영할 수 있다.
  - 모델을 수정하지 않아도 되므로 효율적이다.
- **단점**
  - 검색된 문서의 품질에 따라 답변의 정확성이 달라질 수 있다.
  - 검색 시스템 구축이 필요하다.

## 정리

| 항목       | 파인튜닝 | RAG |
| -------- | ---- | --- |
| 도메인 최적화  | 가능   | 제한적 |
| 최신 정보 반영 | 불가능  | 가능  |
| 구현 난이도   | 높음   | 보통  |
| 유연성      | 낮음   | 높음  |

- LLM은 학습 당시의 데이터만을 기반으로 작동하므로 최신 정보나 기업 내부 자료와 같은 특정한 지식 기반에 접근할 수 없다.
- 파인튜닝은 시간과 비용이 많이 들고 유지보수가 어렵다.
-	반면, RAG는 기존 LLM을 변경하지 않고도 외부 문서를 통해 그 한계를 보완할 수 있다.
- RAG는 특히 빠르게 변화하는 정보를 다루는 분야(예: 기술 지원, 뉴스, 법률 등)에서 유용하게 활용된다. 반면, 정적인 정보에 대해 높은 정확도가 필요한 경우에는 파인튜닝이 효과적이다.


## RAG 작동 단계
- 크게 "**정보 저장(인덱싱)**", "**검색**, **생성**"의 단계로 나눌 수 있다.
  
### 1. 정보 저장(인덱싱)
RAG는 사전에 정보를 가공하여 **벡터 데이터베이스**(Vector 저장소)에 저장해 두고, 나중에 검색할 수 있도록 준비한다. 이 단계는 다음과 같은 과정으로 이루어진다.

1. **Load (불러오기)**
   - 답변시 참조할 사전 정보를 가진 데이터들을 불러온다.
2. **Split/Chunking (문서 분할)**
   - 긴 텍스트를 일정한 길이의 작은 덩어리(*chunk*)로 나눈다.
   - 이렇게 해야 검색과 생성의 정확도를 높일 수 있다.
3. **Embedding (임베딩)**
   - 각 텍스트 조각을 **임베딩 벡터**로 변환한다.
   - 임베딩 벡터는 그 문서의 의미를 벡터화 한 것으로 질문과 유사한 문서를 찾을 때 인덱스로 사용된다.
4. **Store (저장)**
   - 임베딩된 벡터를 **벡터 데이터베이스**(벡터 저장소)에 저장한다.
   - 벡터 데이터베이스는 유사한 질문이나 문장을 빠르게 찾을 수 있도록 특화된 데이터 저장소이다.
   
![rag](figures/rag1.png)

### 2. 검색, 생성

사용자가 질문을 하면 다음과 같은 절차로 답변이 생성된다.
1. **Retrieve (검색)**
   - 사용자의 질문을 임베딩한 후, 이 질문 벡터와 유사한 context 벡터를 벡터 데이터베이스에서 검색하여 찾는다.
2. **Query (질의 생성)**
   - 벡터 데이터베이스에서 검색된 문서 조각과 사용자의 질문을 함께 **프롬프트**(prompt)로 구성하여 LLM에 전달한다.
3. **Generation (응답 생성)**
   - LLM은 받은 프롬프트에 대한 응답을 생성한다.
   
- **RAG 흐름**
  
![Retrieve and Generation](figures/rag2.png)


# Document Loader
- LLM에게 질의할 때 같이 제공할 Data들을 저장하기 위해 먼저 읽어들인다.(Load)
- 데이터 Resouce는 다양하다.
    - 데이터를 로드(load)하는 방식은 저장된 위치와 형식에 따라 다양하다. 
      - 로컬 컴퓨터(Local Computer)에 저장된 문서
        - 예: CSV, Excel, JSON, TXT 파일 등
      - 데이터베이스(Database)에 저장된 데이터셋
      - 인터넷에 존재하는 데이터
        - 예: 웹에 공개된 API, 웹 페이지에 있는 데이터, 클라우드 스토리지에 저장된 파일 등

![rag_load](figures/rag_load.png)

- 다양한 문서 형식(format)에 맞춰 읽어오는 다양한 **document loader** 들을 Langchain에서 지원한다.
    - 다양한 Resource들로 부터 데이터를 읽기 위해서는 다양한 라이브러리를 이용해 서로 다른 방법으로 읽어야 한다.
    - Langchain은 데이터를 읽는 다양한 방식의 코드를 하나의 interface로 사용 할 수 있도록 지원한다.
        - https://python.langchain.com/docs/how_to/#document-loaders
    - 다양한 3rd party library(ppt, github 등등 다양한 3rd party lib도 있음. )들과 연동해 다양한 Resource로 부터 데이터를 Loading 할 수 있다.
        - https://python.langchain.com/docs/integrations/document_loaders/
- **모든 document loader는 기본적으로 동일한 interface(사용법)로 호출할 수있다.**
- **반환타입**
    - **list[Document]**
    - Load 한 문서는 Document객체에 정보들을 넣는다. 여러 문서를 읽을 수 있기 대문에 list에 묶어서 반환한다.
        - **Document 속성**
            - page_content: 문서의 내용
            - metadata(option): 문서에 대한 메타데이터(정보)를 dict 형태로 저장한다. 
            - id(option): 문서의 고유 id
     
- **주의**
    - Langchain을 이용해 RAG를 구현할 때 **꼭 Langchain의 DocumentLoader를 사용해야 하는 것은 아니다.**
    - DocumentLoader는 데이터를 읽어오는 것을 도와주는 라이브러리일 뿐이다. 다른 라이브러리를 이용해서 읽어 들여도 상관없다. 

## 주요 Document Loader

### Text file
- TextLoader 이용

In [1]:
from langchain_community.document_loaders import TextLoader

path = "data/olympic.txt"

# with open(path, 'rt') as f:
#     doc = f.read()

# 1. 객체 생성 -> 읽어올 자원의 정보(경로)를 제공.
loader = TextLoader(path, encoding="utf-8")

# 2. 읽어 오기(Loading)
docs = loader.load()  # lazy_load() -> 문서를 사용하는 시점에 읽어온다.

print(type(docs), len(docs))
print(type(docs[0]))

<class 'list'> 1
<class 'langchain_core.documents.base.Document'>


In [2]:
# Document 객체 속성
doc = docs[0]
print("문서의 정보-metadata:", doc.metadata)
print("문서식별자(ID):", doc.id)
print("문서내용:")
print(doc.page_content[:100])

문서의 정보-metadata: {'source': 'data/olympic.txt'}
문서식별자(ID): None
문서내용:
올림픽
올림픽(영어: Olympic Games, 프랑스어: Jeux olympiques)은 전 세계 각 대륙 각국에서 모인 수천 명의 선수가 참가해 여름과 겨울에 스포츠 경기를 하


In [3]:
from langchain_core.documents import Document

with open(path, 'rt') as f:
    load_doc = f.read()

d = Document(page_content=load_doc, metadata={"category":"올림픽", "path":path})


### PDF
- PyPDF, Pymupdf 등 다양한 PDF 문서를 읽어들이는 파이썬의  3rd party library들을 이용해 pdf 문서를 Load 한다.
    - https://python.langchain.com/docs/integrations/document_loaders/#pdfs
- 각 PDF Loader 특징
    -  PyMuPDFLoader
        -   텍스트 뿐 아니라 이미지, 주석등의 정보를 추출하는데 성능이 좋다.
        -   PyMuPDF 라이브러리 기반
    - PyPDFLoader
        - 텍스트를 빠르게 추출 할 수있다.
        - PyPDF2 라이브러리 기반. 경량 라이브러리로 빠르고 큰 파일도 효율적으로 처리한다.
    - PDFPlumberLoader
        - 표와 같은 복잡한 구조의 데이터 처리하는데 강력한 성능을 보여준다. 텍스트, 이미지, 표 등을 모두 추출할 수 있다. 
        - PDFPlumber 라이브러리 기반
- 설치 패키지
    - DocumentLoader와 연동하는 라이브러리들을 설치 해야 한다.
    - `pip install pypdf -qU`
    - `pip install pymupdf -qU`
    - `pip install pdfplumber -qU`

In [4]:
from langchain_community.document_loaders import PyPDFLoader

# 1. 객체 생성 -> raw 데이터 연결
path = "data/novel/금_따는_콩밭_김유정.pdf"

loader = PyPDFLoader(path)

docs = loader.load()  # List[Document]
len(docs)  # 페이지당 하나의 문서(Document)

23

In [5]:
print(docs[1].page_content)

2 
위키백과
위키백과에  이  글
과  관련된 
자료가  있습니다 .
금  따는  콩밭
🙝 🙟 
땅속  저  밑은  늘  음침하
다 .
고달픈  간드렛불 , 맥없이
푸르끼하다 .
밤과  달라서  낮엔  되우  흐릿하였다 .
겉으로  황토  장벽으로  앞뒤좌우가  콕  막힌  좁직한  구뎅이 .
흡사히  무덤  속같이  귀중중하다 . 싸늘한  침묵 , 쿠더브레한
흙내와  징그러운  냉기만이  그  속에  자욱하다 .
곡괭이는  뻔질  흙을  이르집는다 . 암팡스러이  내려쪼며 ,
퍽  퍽  퍼억 .
이렇게  메떨어진  소리뿐 . 그러나  간간  우수수  하고  벽이  헐
린다 .
영식이는  일손을  놓고  소맷자락을  끌어당기어  얼굴의  땀을
훑는다 . 이놈의  줄이  언제나  잡힐는지  기가  찼다 . 흙  한줌을
집어  코밑에  바짝  들여대고  손가락으로  샅샅이  뒤져본다 . 완
연히  버력은  좀  변한  듯싶다 . 그러나  불통버력이  아주  다  풀
린  것도  아니었다 . 밀똥버력이라야  금이  온다는데  왜  이리
안  나오는지 .
곡괭이를  다시  집어든다 . 땅에  무릎을  꿇고  궁뎅이를  번쩍
든  채  식식거린다 . 곡괭이는  무작정  내려찍는다 . 바닥에서


In [6]:
docs[1].metadata

{'producer': 'Wikisource',
 'creator': 'Wikisource',
 'creationdate': '2024-11-24T07:05:35+00:00',
 'author': 'Unknown',
 'moddate': '2024-11-24T07:05:37+00:00',
 'title': '금 따는 콩밭',
 'source': 'data/novel/금_따는_콩밭_김유정.pdf',
 'total_pages': 23,
 'page': 1,
 'page_label': '2'}

In [8]:
%pip install pymupdf

Collecting pymupdf
  Using cached pymupdf-1.26.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.1-cp39-abi3-macosx_11_0_arm64.whl (22.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.4/22.4 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.1
Note: you may need to restart the kernel to use updated packages.


In [9]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(path)

docs = loader.load()
len(docs)

23

In [10]:
print(docs[1].page_content)

2
위키백과
위키백과에 이 글
과 관련된
자료가 있습니다.
금 따는 콩밭
🙝🙟
땅속 저 밑은 늘 음침하
다.
고달픈 간드렛불, 맥없이
푸르끼하다.
밤과 달라서 낮엔 되우 흐릿하였다.
겉으로 황토 장벽으로 앞뒤좌우가 콕 막힌 좁직한 구뎅이.
흡사히 무덤 속같이 귀중중하다. 싸늘한 침묵, 쿠더브레한
흙내와 징그러운 냉기만이 그 속에 자욱하다.
곡괭이는 뻔질 흙을 이르집는다. 암팡스러이 내려쪼며,
퍽 퍽 퍼억.
이렇게 메떨어진 소리뿐. 그러나 간간 우수수 하고 벽이 헐
린다.
영식이는 일손을 놓고 소맷자락을 끌어당기어 얼굴의 땀을
훑는다. 이놈의 줄이 언제나 잡힐는지 기가 찼다. 흙 한줌을
집어 코밑에 바짝 들여대고 손가락으로 샅샅이 뒤져본다. 완
연히 버력은 좀 변한 듯싶다. 그러나 불통버력이 아주 다 풀
린 것도 아니었다. 밀똥버력이라야 금이 온다는데 왜 이리
안 나오는지.
곡괭이를 다시 집어든다. 땅에 무릎을 꿇고 궁뎅이를 번쩍
든 채 식식거린다. 곡괭이는 무작정 내려찍는다. 바닥에서


In [11]:
docs[0].metadata

{'producer': 'Wikisource',
 'creator': 'Wikisource',
 'creationdate': '2024-11-24T07:05:35+00:00',
 'source': 'data/novel/금_따는_콩밭_김유정.pdf',
 'file_path': 'data/novel/금_따는_콩밭_김유정.pdf',
 'total_pages': 23,
 'format': 'PDF 1.4',
 'title': '금 따는 콩밭',
 'author': 'Unknown',
 'subject': '',
 'keywords': '',
 'moddate': '2024-11-24T07:05:37+00:00',
 'trapped': '',
 'modDate': "D:20241124070537+00'00'",
 'creationDate': "D:20241124070535+00'00'",
 'page': 0}

### Web

- WebBaseLoader 이용
  - 입력받은 URL의 웹 문서를 읽어 문서로 로드한다. 웹 크롤링작업 없이 웹상의 문서를 가져올 수있다.
  - 내부적으로 BeautifulSoup을 이용해 웹문서를 parsing한다.
- https://python.langchain.com/docs/how_to/document_loader_web/

In [12]:
%pip install bs4

Collecting bs4
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting beautifulsoup4 (from bs4)
  Using cached beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Using cached soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Using cached beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
Using cached soupsieve-2.7-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [bs4]
[1A[2KSuccessfully installed beautifulsoup4-4.13.4 bs4-0.0.2 soupsieve-2.7
Note: you may need to restart the kernel to use updated packages.


In [13]:
from langchain_community.document_loaders import WebBaseLoader

url = [
    "https://m.sports.naver.com/wfootball/article/421/0008308548",
    "https://m.sports.naver.com/wfootball/article/450/0000131435"
]

my_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36"

loader = WebBaseLoader(
    web_path=url, # 개별 페이지 -> str, 여러페이지 -> list[str]
    header_template={
        "user-agent":my_user_agent
    }
)

docs = loader.load()
len(docs)

USER_AGENT environment variable not set, consider setting it to identify your requests.


2

In [14]:
docs[0].metadata

{'source': 'https://m.sports.naver.com/wfootball/article/421/0008308548',
 'title': '[단독]쿠팡플레이, 스포츠패스 첫 가격 월 1만원…15일부터 시행',
 'language': 'ko'}

In [15]:
print(docs[0].page_content)

[단독]쿠팡플레이, 스포츠패스 첫 가격 월 1만원…15일부터 시행NAVER스포츠메뉴홈야구해외야구축구해외축구농구배구N골프일반e스포츠아웃도어NEW뉴스영상일정순위포토홈 바로가기NAVER스포츠마이팀팀 추가응원하는 팀을 구독해보세요!스포츠야구해외야구축구해외축구농구배구N골프일반e스포츠아웃도어콘텐츠오늘의 경기승부예측연재이슈톡대학스포츠랭킹기타고객센터공식 블로그메뉴 닫기본문 바로가기[단독]쿠팡플레이, 스포츠패스 첫 가격 월 1만원…15일부터 시행입력2025.06.12. 오후 2:00수정2025.06.12. 오후 3:39기사원문김정현 기자양새롬 기자공감좋아요0슬퍼요0화나요0팬이에요0후속기사 원해요0텍스트 음성 변환 서비스본문 듣기를 종료하였습니다.글자 크기 변경공유하기쿠팡 와우 회원, 월 총 요금  '1만 7890원'(쿠팡플레이 갈무리)/뉴스1(서울=뉴스1) 김정현 양새롬 기자 = 쿠팡플레이가 부가서비스인 '스포츠 패스'의 금액을 월 1만 원으로 확정했다.12일 업계에 따르면 쿠팡플레이는 오는 15일 해외 스포츠 등의 콘텐츠를 유료 부가 서비스로 제공하는 스포츠 패스의 요금을 월 1만 원으로 결정했다. 공식 가격은 1만 2000원이나, 출시 할인가로 추정된다.쿠팡 와우 멤버십 구독료인 월 7890원에 스포츠패스 금액을 더하면 월 이용금액은 1만 7890원이 된다. 할인가가 종료될 경우 월 구독료만 2만 원 수준이다.이번 패스를 통해 볼수 있는 스포츠 리그는 △FIFA대회(FIFA클럽월드컵)△유럽축구리그(프리미어리그 2025~2026 시즌, 라리가, 분데스리가, 분데스리가2, 리그1, EFL 챔피언십 EFL리그원, 에레디비시) △유럽축구 토너먼트(FA컵, 카라바오컵, 커뮤니티쉴드, 코파 델레이, 수페르코파데 에스파냐, DFB-포칼, DFL-슈퍼컵, 쿠프드프랑스, 트로페데 샹피옹, 버투트로피) △아시아축구(AFC아시안컵, AFC챔피언스리그 엘리트, AFC챔피언스리그2, 기타AFC주관 국제 대회) △세계축구(월드컵남미 예선, 클럽 친선경기, 해외 국가 친선경기) 등이다.축구 외에도 

- 페이지의 일부분만 가져오기.
- BeautifulSoup의 SoupStrainer 를 이용.
    - BeautifulSoup("html문서", parse_only=Strainer객체)
        - Strainer객체에 지정된 영역에서만 내용 찾는다.
    - Strainer("태그명") -> 지정한 태그 내에서만 찾는다.
    - Strainer(name="태그명", attrs={속성명:속성값}) -> 지정한 태그 중 속성명=속성값인 것 내에서만 찾는다.


In [16]:
import bs4

loader = WebBaseLoader(
    web_path=url,
    # WebBaseLoader가 bs4를 사용. bs4에 전달할 파라미터를 설정하는 변수
    bs_kwargs={
        "parse_only":bs4.SoupStrainer(attrs={"class":"_article_content"})
    }
)

docs = loader.load()
len(docs)

2

In [17]:
print(docs[0].page_content)

쿠팡 와우 회원, 월 총 요금  '1만 7890원'(쿠팡플레이 갈무리)/뉴스1(서울=뉴스1) 김정현 양새롬 기자 = 쿠팡플레이가 부가서비스인 '스포츠 패스'의 금액을 월 1만 원으로 확정했다.12일 업계에 따르면 쿠팡플레이는 오는 15일 해외 스포츠 등의 콘텐츠를 유료 부가 서비스로 제공하는 스포츠 패스의 요금을 월 1만 원으로 결정했다. 공식 가격은 1만 2000원이나, 출시 할인가로 추정된다.쿠팡 와우 멤버십 구독료인 월 7890원에 스포츠패스 금액을 더하면 월 이용금액은 1만 7890원이 된다. 할인가가 종료될 경우 월 구독료만 2만 원 수준이다.이번 패스를 통해 볼수 있는 스포츠 리그는 △FIFA대회(FIFA클럽월드컵)△유럽축구리그(프리미어리그 2025~2026 시즌, 라리가, 분데스리가, 분데스리가2, 리그1, EFL 챔피언십 EFL리그원, 에레디비시) △유럽축구 토너먼트(FA컵, 카라바오컵, 커뮤니티쉴드, 코파 델레이, 수페르코파데 에스파냐, DFB-포칼, DFL-슈퍼컵, 쿠프드프랑스, 트로페데 샹피옹, 버투트로피) △아시아축구(AFC아시안컵, AFC챔피언스리그 엘리트, AFC챔피언스리그2, 기타AFC주관 국제 대회) △세계축구(월드컵남미 예선, 클럽 친선경기, 해외 국가 친선경기) 등이다.축구 외에도 △레이싱(F1, F1 아카데미, 나스카) △골프(LIV 골프) △농구(남자 농구 아시아컵, 여자 농구 아시아컵) △미식 축구(NFL) 등도 스포츠 패스를 별도 구독해야 시청 가능하다. 올 가을부터는 NBA 경기도 독점 제공할 예정이다.쿠팡플레이는 대한민국 축구 대표팀, 한국 프로 축구, 이벤트 매치(쿠팡플레이 시리즈)는 별도 패스 가입 없이 와우 회원들이 시청할 수 있도록 할 예정이다. (쿠팡플레이 홈페이지 갈무리) /뉴스1이같은 정보는 쿠팡플레이 공식 홈페이지를 통해 '스포츠 패스' 페이지가 노출되며 알려졌다.2만 원에 육박하는 가격 정보가 알려지자 국내 스포츠 커뮤니티에서는 "축구만 보는데 관심없는 중계도 묶어서 비싸게 판매하는 대신 선택 폭을 늘

### ArxivLoader
- https://github.com/lukasschwab/arxiv.py
- [arXiv-아카이브](https://arxiv.org/) 는 미국 코렐대학에서 운영하는 **무료 논문 저장소**로, 물리학, 수학, 컴퓨터 과학, 생물학, 금융, 경제 등 **과학, 금융 분야의 논문**들을 공유한다.
- `ArxivLoader` 를 사용해 원하는 주제의 논문들을 arXiv에서 가져와 load할 수 있다.
- **arXiv API**를 사용해 논문을 가져올 수 있다.
  - https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html
- 설치
  - `pip install langchain-community -qU`
  - `pip install arxiv -qU`



In [19]:
%pip install arxiv langchain-community

Collecting arxiv
  Using cached arxiv-2.2.0-py3-none-any.whl.metadata (6.3 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Downloading arxiv-2.2.0-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
Building wheels for collected packages: sgmllib3k
[33m  DEPRECATION: Building 'sgmllib3k' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'sgmllib3k'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for sgmllib3k

In [20]:
import arxiv 

# 검색 기준 설정.
search = arxiv.Search(
    query="RAG", # 검색어
    max_results=2, # 검색 결과 최대 개수.
    sort_by=arxiv.SortCriterion.Relevance
)
# 정렬기준 - Relevance: 검색어 관련성이 높은 순서
#          - LastUpdatedDate: 논문이 마지막으로 수정된 날짜 기준.
#          - SubmittedDate: 처음 제출된 날짜 기준.

# 검색
client = arxiv.Client()
result = client.results(search)
print(type(result))

<class 'itertools.islice'>


In [21]:
doc1 = next(result)  # 첫번째 문서 for page in result:


In [22]:
print("논문제목:", doc1.title)
print("저자:", doc1.authors)
# print("요약: ", doc1.summary)
print("논문 PDF URL:", doc1.pdf_url)

논문제목: Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks
저자: [arxiv.Result.Author('Yunfan Gao'), arxiv.Result.Author('Yun Xiong'), arxiv.Result.Author('Meng Wang'), arxiv.Result.Author('Haofen Wang')]
논문 PDF URL: http://arxiv.org/pdf/2407.21059v1


In [23]:
# 다운로드
import os
os.makedirs("papers", exist_ok=True)

client = arxiv.Client()
result = client.results(search)

for idx, paper in enumerate(result, start=10):
    paper.download_pdf("papers", f"{idx}.pdf")
# doc1.download_pdf(다운받을 디렉토리, 파일명명)

In [None]:
%pip install pymupdf

Note: you may need to restart the kernel to use updated packages.


In [24]:
from langchain_community.document_loaders import ArxivLoader

loader = ArxivLoader(
    query="Advanced RAG", 
    top_k_results=1, # 몇개 검색할지 지정.
)

docs = loader.load()
len(docs)

1

In [25]:
docs[0].metadata

{'Published': '2024-07-26',
 'Title': 'Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks',
 'Authors': 'Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang',
 'Summary': 'Retrieval-augmented Generation (RAG) has markedly enhanced the capabilities\nof Large Language Models (LLMs) in tackling knowledge-intensive tasks. The\nincreasing demands of application scenarios have driven the evolution of RAG,\nleading to the integration of advanced retrievers, LLMs and other complementary\ntechnologies, which in turn has amplified the intricacy of RAG systems.\nHowever, the rapid advancements are outpacing the foundational RAG paradigm,\nwith many methods struggling to be unified under the process of\n"retrieve-then-generate". In this context, this paper examines the limitations\nof the existing RAG paradigm and introduces the modular RAG framework. By\ndecomposing complex RAG systems into independent modules and specialized\noperators, it facilitates a highly reconfigurabl

In [26]:
print(docs[0].page_content)

1
Modular RAG: Transforming RAG Systems into
LEGO-like Reconfigurable Frameworks
Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang
Abstract—Retrieval-augmented
Generation
(RAG)
has
markedly enhanced the capabilities of Large Language Models
(LLMs) in tackling knowledge-intensive tasks. The increasing
demands of application scenarios have driven the evolution
of RAG, leading to the integration of advanced retrievers,
LLMs and other complementary technologies, which in turn
has amplified the intricacy of RAG systems. However, the rapid
advancements are outpacing the foundational RAG paradigm,
with many methods struggling to be unified under the process
of “retrieve-then-generate”. In this context, this paper examines
the limitations of the existing RAG paradigm and introduces
the modular RAG framework. By decomposing complex RAG
systems into independent modules and specialized operators, it
facilitates a highly reconfigurable framework. Modular RAG
transcends the traditional linear architect

In [27]:
# 논문 요약만 조회
summary_docs = loader.get_summaries_as_docs()
print(summary_docs)

[Document(metadata={'Entry ID': 'http://arxiv.org/abs/2407.21059v1', 'Published': datetime.date(2024, 7, 26), 'Title': 'Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks', 'Authors': 'Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang'}, page_content='Retrieval-augmented Generation (RAG) has markedly enhanced the capabilities\nof Large Language Models (LLMs) in tackling knowledge-intensive tasks. The\nincreasing demands of application scenarios have driven the evolution of RAG,\nleading to the integration of advanced retrievers, LLMs and other complementary\ntechnologies, which in turn has amplified the intricacy of RAG systems.\nHowever, the rapid advancements are outpacing the foundational RAG paradigm,\nwith many methods struggling to be unified under the process of\n"retrieve-then-generate". In this context, this paper examines the limitations\nof the existing RAG paradigm and introduces the modular RAG framework. By\ndecomposing complex RAG systems into ind

In [28]:
summary_docs[0].metadata

{'Entry ID': 'http://arxiv.org/abs/2407.21059v1',
 'Published': datetime.date(2024, 7, 26),
 'Title': 'Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks',
 'Authors': 'Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang'}

In [29]:
print(summary_docs[0].page_content)

Retrieval-augmented Generation (RAG) has markedly enhanced the capabilities
of Large Language Models (LLMs) in tackling knowledge-intensive tasks. The
increasing demands of application scenarios have driven the evolution of RAG,
leading to the integration of advanced retrievers, LLMs and other complementary
technologies, which in turn has amplified the intricacy of RAG systems.
However, the rapid advancements are outpacing the foundational RAG paradigm,
with many methods struggling to be unified under the process of
"retrieve-then-generate". In this context, this paper examines the limitations
of the existing RAG paradigm and introduces the modular RAG framework. By
decomposing complex RAG systems into independent modules and specialized
operators, it facilitates a highly reconfigurable framework. Modular RAG
transcends the traditional linear architecture, embracing a more advanced
design that integrates routing, scheduling, and fusion mechanisms. Drawing on
extensive research, this pa

### Docling
- IBM Research에서 개발한 오픈소스 문서처리 도구로 다양한 종류의 문서를 구조화된 데이터로 변환해 생성형 AI에서 활용할 수있도록 지원한다.
- **주요기능**
  - PDF, DOCX, PPTX, XLSX, HTML, 이미지 등 여러 형식을 지원
  - PDF의 **페이지 레이아웃, 읽기 순서, 표 구조, 코드, 수식** 등을 분석하여 정확하게 읽어들인다.
  - OCR을 지원하여 스캔된 PDF나 이미지에서 텍스트를 추출할 수있다.
  - 읽어들인 내용을 markdown, html, json등 다양한 형식으로 출력해준다.
- 설치 : `pip install langchain-docling ipywidgets -qU` 
- 참조
  - docling 사이트: https://github.com/docling-project/docling
  - 랭체인-docling https://python.langchain.com/docs/integrations/document_loaders/docling/

In [1]:
%pip install langchain-docling ipywidgets -qU

  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for docling-parse [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[45 lines of output][0m
  [31m   [0m python prefix:  /Users/giwonjun/anaconda3/envs/lang_env
  [31m   [0m python executable:  /Users/giwonjun/anaconda3/envs/lang_env/bin/python
  [31m   [0m pybind11_cmake_dir='/private/var/folders/69/bt3_5fns57n_cp5_ydbrhfh80000gn/T/pip-build-env-amkign5y/overlay/lib/python3.12/site-packages/pybind11/share/cmake/pybind11'
  [31m   [0m 
  [31m   [0m launch: cmake -B /private/var/folders/69/bt3_5fns57n_cp5_ydbrhfh80000gn/T/pip-install-9hsqglry/docling-parse_2829d61d0ebb4c62a4104b9fc2aa8105/build -DUSE_SYSTEM_DEPS=OFF -DPYTHON_EXECUTABLE=/Users/giwonjun/anaconda3/envs/lang_env/bin/python -Dpybind11_DIR=/private/var/folders/69/bt3_5fns57n_cp5_ydbrhfh80000gn/T/pip-build-env-amkign5y/overlay/lib/python3.12

In [2]:
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
import os
# huggingface-hub 로그인
login(os.getenv("HUGGINGFACE_API_KEY"))

In [4]:
path = "papers/1.pdf" #문서 경로. local file경로, url
path = "https://arxiv.org/pdf/2506.09669"

loader = DoclingLoader(file_path=path, export_type=ExportType.MARKDOWN)
docs = loader.load()
len(docs)

Downloading detection model, please wait. This may take several minutes depending upon your network connection.
Downloading recognition model, please wait. This may take several minutes depending upon your network connection.
  "scores": score[score > threshold],


1

In [5]:
docs[0].metadata

{'source': 'https://arxiv.org/pdf/2506.09669'}

In [6]:
# print(docs[0].page_content)
from IPython.display import Markdown

Markdown(docs[0].page_content)

## Query-Level Uncertainty in Large Language Models

## Lihu Chen , Gaël Varoquaux 1 2

- 1 Imperial College London, UK

2 Soda, Inria Saclay, France lihu.chen@imperial.ac.uk gael.varoquaux@inria.fr

## Abstract

It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty , which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called Internal Confidence , which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance. The code is available at /github https://github.com/tigerchen52/ query\_level\_uncertainty

## 1 Introduction

Large language Models (LLMs) have their knowledge boundaries (Li et al., 2024; Yin et al., 2024; Ren et al., 2025), which means that there are certain problems that they cannot provide accurate outputs. It is crucial for LLMs to be self-aware of their limitations, i.e., know what I know and know what I don't know (Kadavath et al., 2022; Amayuelas et al., 2024).

Possessing awareness of knowledge boundaries provides several advantages in developing efficient and trustworthy AI. First, if LLMs can identify known-unknown or simple-hard queries, they can smartly perform adaptive inference to balance the trade-offs between computational cost and out-

Figure 1: Illustrating the difference between answerlevel and query-level uncertainty. Query-level uncertainty estimating known or unknown queries ( knowledge boundary ) before generating answers, which is useful for adaptive inference, e.g., efficient RAG and fast-slow reasoning.

put quality. For queries beyond their parametric knowledge, they can choose to find relevant external knowledge via RAG (Lewis et al., 2020) or tool calls (Schick et al., 2023). When faced with hard problems, LLMs can engage in slow (or deep) thinking to improve their outputs, which is also known as test-time scaling (Snell et al., 2024; Zhang et al., 2025). Alternatively, another solution is to defer a complex problem to a larger model via model cascading (Dohan et al., 2022; Gupta et al., 2024). This adaptive inference ensures that computational resources are allocated effectively, which reduces costs while maintaining performance. Second, estimating whether a query is answerable enhances the honesty and trustworthiness of LLMs. When LLMs identify uncertain queries, they can use the abstention strategy (Wen et al., 2024) to withhold responses, which is important in high-stakes domains like healthcare (Tomani et al., 2024).

In this work, we propose a new concept, QueryLevel Uncertainty , to estimate a model's knowledge with regard to a given query. The research question here is: Given a query, can we determine if the model is able to address it without generating any tokens? Most existing work focus on answerlevel uncertainty, which measures the uncertainty associated with a specific answer, helping us assess the reliability of outputs (Shorinwa et al., 2024; Vashurin et al., 2025). The main distinction here is that we shift from post-generation uncertainty to pre-generation uncertainty, which aims to measure how certain an LLM can solve this query, as shown in Figure 1.

Prior studies propose learning a probe on internal states to predict uncertainties of queries (Gottesman and Geva, 2024; Kossen et al., 2024). Another branch of work attempts to teach LLMs to explicitly express 'I don't know' in their responses via fine-tuning methods (Amayuelas et al., 2024; Kapoor et al., 2024; Cohen et al., 2024; Zhang et al., 2024a). One potential issue of these studies is that they often require fine-tuning and training samples, which introduces additional overhead and may limit their generalizability. We aim to introduce a training-free approach to estimate querylevel uncertainty, which is simple yet effective.

Our approach relies on self-evaluation across internal layers and tokens, which is called Internal Confidence . The proposed approach is based on a simple assumption: LLMs can self-evaluate their knowledge about a query by answering a yesno question. Inspired by the uncertainty method P(True) (Kadavath et al., 2022), we can compute the probability P(Yes) to indicate the model's confidence. To fully use latent knowledge within LLMs, we compute this kind of P(Yes) at each layer and token position. Following that, we aggregate these signals to obtain the final confidence score. This aggregation is motivated by prior work showing that leveraging logical consistency across layers can improve outputs (Burns et al., 2022; Chuang et al., 2023; Xie et al., 2024). Specifically, we perform a weighted sum across layers and tokens, and the weights are derived from attenuated encoding (Chen et al., 2023), which can control the influence of adjacent units.

To validate the effectiveness of our proposed internal confidence, we conduct experiments on three datasets that cover factual QA and mathematical reasoning tasks. For comparison, we adapt the existing answer-level methods to compute the querylevel uncertainty. Experimental results demonstrate that our proposed internal confidence can distinguish known and unknown queries better than various baselines. In terms of applications, we showcase that our proposed method can help efficient

RAG and model cascading. On the one hand, internal confidence can guide users to assess the tradeoffs between cost and quality when invoking additional services. On the other hand, it brings a 'benefit region', where inference overhead can be reduced without compromising performance.

To conclude, we propose a simple yet effective, training-free method to estimate query-level uncertainty, which can determine if a model can address a given query without generating any tokens.

## 2 Related Work

## 2.1 Uncertainty Estimation

Existing methods mainly focus on estimating the uncertainty of LLM-generated responses, which aim to provide a score to indicate the reliability of a query-answer pair (Geng et al., 2024; Shorinwa et al., 2024; Vashurin et al., 2025). These approaches often rely on internal states (Chen et al., 2024a) or textual responses (Kuhn et al., 2023), and commonly use calibration techniques to mitigate issues such as overconfidence (Zhang et al., 2024b) and biases (Chen et al., 2024b). Notably, these methods assess post-generation reliability, i.e., they evaluate uncertainty about a particular answer. In contrast, there is limited research on quantifying how well a model can address a query prior to token generation. For example, Gottesman and Geva (2024) propose training a lightweight probe on internal representations to estimate the model's knowledge about specific entities. Similarly, Semantic Entropy Probes (Kossen et al., 2024) suggest that internal model states can implicitly encode semantic uncertainty, even before any output is generated. To the best of our knowledge, this work is the first to formally define query-level uncertainty and investigate it systematically.

## 2.2 Knowledge Boundary Detection

LLMs should faithfully assess their level of confidence in answering a query. This knowledge boundary awareness (Li et al., 2024; Yin et al., 2024; Wang et al., 2024) is essential to build reliable AI systems, particularly in high-stakes domains such as healthcare and law. A pioneering study by Kadavath et al. (2022) explores whether language models can be trained to predict when they 'know' the answer to a given query, introducing the concept of 'I Know' (IK) prediction. Based on this idea, subsequent work has proposed methods to help LLMs become explicitly aware of their knowledge limitations through fine-tuning strategies (Amayuelas et al., 2024; Kapoor et al., 2024). Cohen et al. (2024) further advances this line of research by introducing a special [IDK] (' I don't know ') token into the model's vocabulary, allowing the direct expression of uncertainty in its output. Similarly, RTuning (Zhang et al., 2024a) tunes LLMs to refrain from responding to questions beyond their parametric knowledge. While these abstention-based approaches show benefits in mitigating hallucinations (Wen et al., 2024), they often require additional fine-tuning, which introduces overhead and may limit generalizability across models and tasks. In this work, we propose a training-free method to identify the knowledge boundary of an LLM, which offers a more generalizable and efficient alternative to detect the knowledge boundary of LLMs.

## 3 Preliminary

## 3.1 Aleatoric and Epistemic Uncertainty

Uncertainty in machine learning is commonly categorized into two main types: aleatoric and epistemic uncertainty (Hora, 1996; Der Kiureghian and Ditlevsen, 2009; Hüllermeier and Waegeman, 2021). These distinctions are often overlooked in the context of LLM uncertainty estimation. Aleatoric uncertainty arises from inherent randomness in the data, such as ambiguous inputs or conflicting annotations. This type of uncertainty is irreducible, as it reflects intrinsic noise in the input data. In contrast, epistemic uncertainty stems from a lack of knowledge, often due to insufficient training data and limited model capacity. Unlike aleatoric uncertainty, epistemic uncertainty is reducible with additional data or advanced modeling. In this work, we focus specifically on epistemic uncertainty, with the goal of evaluating whether an LLM possesses sufficient knowledge to answer a given query. Although it is possible that a dataset may contain some ambiguous queries and noisy labels, we assume that the benchmark datasets used in our experiments are well-curated, and have minimal ambiguity. This assumption allows us to reasonably minimize the impact of aleatoric uncertainty, and study the epistemic uncertainty in a clear way.

## 3.2 Uncertainty and Confidence

In the context of LLMs, the terms uncertainty and confidence are often used interchangeably (antonyms). However, the two concepts have sub- tle differences. As noted by Lin et al. (2023), uncertainty is a holistic property of the entire predictive distribution, while confidence refers to the model's estimated confidence level associated with a specific answer. For example, given a query x = 'What is the capital of France' , estimating uncertainty requires the distribution over all possible answers, e.g., Paris, Toulouse, etc. , as explained by the semantic entropy framework (Kuhn et al., 2023). In contrast, the conditional probability P Y ( = Paris | x ) can serve as a confidence here to indicate the correctness of a specific answer. In the context of query-level uncertainty, we treat uncertainty and confidence as antonyms, as obtaining full probability distributions over all possible queries for a given model is infeasible.

## 4 Problem Statement and Method

In this section, we describe our problem definition and introduce our method, Internal Confidence , a score that reflects whether an LLM can address a query in its own knowledge, prior to generating tokens.

## 4.1 Problem Statement

Given a query (including prompt words) x = ( x , . . . , x 1 N ) , we aim to quantify the query-level uncertainty, U ( x ) , without generating an answer y . This is different from existing uncertainty methods that estimate the uncertainty associated with a specific generated answer, denoted as U ( x y , ) . We define that if an LLM can answer a query correctly in greedy decoding, the query falls within the knowledge boundary of the model, and its answer can be reliable. Otherwise, the query falls beyond the model's boundary, and it does not possess sufficient knowledge to answer it. We use this standard to evaluate the estimated query-level uncertainty, i.e., a lower uncertainty indicates a model is more likely to output the correct answer. Although different decoding strategies impact LLM outputs (Song et al., 2024), we aim to measure the internal knowledge of a model in a deterministic way.

Here, we focus on queries with definite answers, which have broad applications such as factual QA and mathematical reasoning. While contentious queries with open answers are also important in areas such as politics and philosophy, they are out of the scope of this work.

Figure 2: Left: the internal P(Yes) across tokens and layers. Middle: the AUC of P(Yes) across tokens and layers. Right: decay weights with different localities. Model: Llama-8B; Dataset: GSM8K validation set.

## 4.2 Method

Existing findings reveal that LLMs can express verbalized uncertainty in their responses (Tian et al., 2023; Xiong et al., 2024), which reflects that LLMs can evaluate the answer correctness in their own knowledge. Similarly, we can prompt an LLM to assess its confidence in answering a given query by using a yes-no format: 'Respond only with 'Yes' or 'No' to indicate whether you are capable of answering the {Query} accurately. Answer Yes or No:' . Following that, we can compute the probability P(Yes) at the last token ( x N ):

<!-- formula-not-decoded -->

where N is the index of the last token in the query, and L is the index of the last layer of the model. h ( L ) N ∈ R d is the hidden state and d is the dimensionality of the hidden representations. W unemb ∈ R |V|× d is the unembedding matrix that maps the hidden state h ( L ) N to logits over the vocabulary V . P(Yes) can serve as a query-level confidence score here, which is somehow correlated with verbalized uncertainty (Tian et al., 2023), but the main difference is that this method only makes a single forward pass of the query without generating any answer tokens.

However, P(Yes) does fully use internal states of LLMs, which preserves rich latent information about estimating uncertainty (Azaria and Mitchell, 2023; Chen et al., 2024a). Furthermore, prior work demonstrates that using logical consistency across layers can improve outputs (Burns et al., 2022; Chuang et al., 2023; Xie et al., 2024). Therefore, we propose the Internal Confidence , which leverages latent knowledge across different layers and tokens. Let f θ denote the transformation function for computing hidden states, parameterized by θ . The hidden state for the query x n of the query at layer l is computed as:

<!-- formula-not-decoded -->

In total, the model contains N × L such latent representations, and we can use Equation 4.2 to compute the P(Yes) for each h ( ) l n .

Figure 2a shows the average P(Yes) of Llama-8B on the mathematical queries (the validation set of GSM8K (Cobbe et al., 2021)), across layers and query tokens 1 . We observe that the probability increases gradually from low to high layers and from left to right positions, presenting diverse behaviors. If we treat each P ( Yes | h ( ) l n ) as a confidence score and evaluate Area Under the Curve (AUC), we can obtain an AUC heatmap to show how well the model can distinguish known and unknown queries. As shown in Figure 2b, the top right score is not optimal. Actually, the representation h (27) 5 can achieve the best AUC, and the performance gradually declines in regions surrounding this point. We refer to this optimal point as Decision Center . It is important to note that the location of the Decision Center is sensitive to both model architecture and task type.

To improve the naive P(Yes), we can apply a weighted average centering around the decision center, which serves as an ensemble strategy to enhance calibration and expressivity (Zhang et al.,

1 Here, we consider tokens after the {Query} , which means that a model has seen the entire query and is able to guess its knowledge gap.

Table 1: Overall performances of different query-level uncertainty methods.

|                                      | TriviaQA   | TriviaQA   | TriviaQA   | SciQ     | SciQ     | SciQ     | GSM8K    | GSM8K    | GSM8K    | Avg      | Avg      | Avg      |
|--------------------------------------|------------|------------|------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
| Method                               | ↑ AUC      | ↑ PRR      | ↓ ECE      | ↑ AUC    | ↑ PRR    | ↓ ECE    | ↑ AUC    | ↑ PRR    | ↓ ECE    | ↑ AUC    | ↑ PRR    | ↓ ECE    |
| Phi-3.8B                             | Phi-3.8B   | Phi-3.8B   | Phi-3.8B   | Phi-3.8B | Phi-3.8B | Phi-3.8B | Phi-3.8B | Phi-3.8B | Phi-3.8B | Phi-3.8B | Phi-3.8B | Phi-3.8B |
| Max ( - log p )                      | 55.5       | 10.0       | -          | 51.4     | 2.9      | -        | 55.0     | 11.3     | -        | 54.0     | 8.1      | -        |
| Predictive Entropy                   | 58.9       | 17.9       | -          | 51.2     | 3.9      | -        | 63.6     | 25.7     | -        | 57.9     | 15.8     | -        |
| Min-K Entropy                        | 59.9       | 20.0       | -          | 52.7     | 4.9      | -        | 60.4     | 17.9     | -        | 57.7     | 14.3     | -        |
| Attentional Entropy                  | 60.6       | 21.4       | -          | 56.2     | 9.4      | -        | 52.4     | 4.4      | -        | 56.4     | 11.7     | -        |
| Perplexity                           | 61.8       | 24.3       | -          | 57.7     | 16.6     | -        | 53.6     | 6.9      | -        | 57.7     | 15.9     | -        |
| Internal Semantic Similarity         | 48.7       | -2.4       | 0.3        | 46.9     | -5.9     | 12.2     | 47.9     | -2.6     | 35.2     | 47.8     | -3.6     | 15.9     |
| P(Yes)                               | 58.1       | 16.4       | 13.9       | 58.8     | 16.9     | 10.8     | 56.6     | 12.0     | 7.6      | 57.8     | 15.1     | 10.8     |
| Internal Confidence ( w/ naive avg ) | 58.8       | 17.3       | 19.9       | 52.4     | 4.5      | 3.3      | 54.7     | 14.7     | 21.7     | 55.3     | 12.2     | 15.0     |
| Internal Confidence                  | 56.2       | 13.1       | 13.9       | 57.2     | 15.2     | 8.2      | 57.2     | 12.9     | 6.0      | 56.9     | 13.7     | 9.4      |
| Llama-8B                             | Llama-8B   | Llama-8B   | Llama-8B   | Llama-8B | Llama-8B | Llama-8B | Llama-8B | Llama-8B | Llama-8B | Llama-8B | Llama-8B | Llama-8B |
| Max ( - log p )                      | 54.9       | 11.1       | -          | 51.4     | 1.9      | -        | 53.3     | 10.4     | -        | 53.2     | 7.8      | -        |
| Predictive Entropy                   | 58.5       | 17.7       | -          | 51.4     | 3.2      | -        | 66.1     | 28.0     | -        | 58.7     | 16.3     | -        |
| Min-K Entropy                        | 58.1       | 17.4       | -          | 53.5     | 7.9      | -        | 57.5     | 13.2     | -        | 56.4     | 12.8     | -        |
| Attentional Entropy                  | 59.4       | 18.7       | -          | 57.7     | 15.2     | -        | 56.1     | 13.5     | -        | 57.7     | 15.8     | -        |
| Perplexity                           | 58.6       | 17.1       | -          | 58.3     | 15.1     | -        | 53.2     | 4.3      | -        | 56.7     | 12.2     | -        |
| Internal Semantic Similarity         | 44.1       | -14.4      | 24.4       | 46.1     | -7.1     | 30.8     | 52.7     | 6.7      | 45.9     | 47.6     | -4.9     | 33.7     |
| P(Yes)                               | 66.4       | 33.0       | 27.5       | 51.3     | 2.4      | 23.7     | 62.2     | 24.8     | 11.6     | 60.0     | 20.1     | 20.9     |
| Internal Confidence ( w/ naive avg ) | 67.2       | 34.4       | 14.9       | 58.6     | 15.4     | 21.5     | 59.1     | 18.7     | 29.2     | 61.6     | 22.8     | 21.9     |
| Internal Confidence                  | 67.8       | 34.5       | 19.1       | 56.4     | 13.0     | 18.9     | 62.9     | 27.9     | 1.3      | 62.4     | 25.1     | 13.1     |
| Qwen-14B                             | Qwen-14B   | Qwen-14B   | Qwen-14B   | Qwen-14B | Qwen-14B | Qwen-14B | Qwen-14B | Qwen-14B | Qwen-14B | Qwen-14B | Qwen-14B | Qwen-14B |
| Max ( - log p )                      | 56.5       | 12.4       | -          | 54.1     | 6.9      | -        | 54.3     | 13.5     | -        | 55.0     | 10.9     | -        |
| Predictive Entropy                   | 59.3       | 18.9       | -          | 53.2     | 6.9      | -        | 66.4     | 32.6     | -        | 59.6     | 19.5     | -        |
| Min-K Entropy                        | 59.9       | 20.0       | -          | 55.7     | 11.3     | -        | 63.0     | 30.9     | -        | 59.5     | 20.7     | -        |
| Attentional Entropy                  | 59.1       | 17.2       | -          | 59.4     | 19.2     | -        | 54.9     | 3.1      | -        | 57.8     | 13.2     | -        |
| Perplexity                           | 59.1       | 17.8       | -          | 60.1     | 20.7     | -        | 54.0     | 7.3      | -        | 57.7     | 15.3     | -        |
| Internal Semantic Similarity         | 51.0       | 2.5        | 2.0        | 45.5     | -7.7     | 14.9     | 47.5     | -4.6     | 33.1     | 48.0     | -3.3     | 16.7     |
| P(Yes)                               | 63.2       | 25.8       | 31.9       | 61.0     | 22.4     | 23.9     | 54.7     | 7.5      | 5.8      | 59.6     | 18.6     | 20.5     |
| Internal Confidence ( w/ naive avg ) | 63.3       | 27.6       | 8.0        | 60.5     | 20.5     | 15.3     | 61.7     | 28.4     | 36.3     | 61.8     | 25.5     | 19.9     |
| Internal Confidence                  | 69.1       | 38.4       | 28.7       | 65.0     | 30.8     | 20.6     | 62.7     | 28.4     | 5.5      | 65.6     | 32.5     | 18.3     |

2020; Stickland and Murray, 2020). We refer to this process as Internal Confidence (IC) , which can be denoted as:

<!-- formula-not-decoded -->

To reflect the observations that AUC performances gradually decay from the decision center, we adopt the Attenuated Encoding to compute the above two weight vectors (Chen et al., 2023)

where w ( ) l n is the weight for each h ( ) l n . The equation describes a two-step aggregation process. First, we compute a weighted sum across layers for each individual token. Then, we apply a second weighted average over these token-level aggregated scores. Ideally, this process requires a layer weight matrix W layer ∈ R N × L for the first step and a token weight matrix W token ∈ R 1 × N for the second step. Through this aggregation, we are able to obtain a final confidence score.

In a practical implementation, the decision center is static and fixed to the last token and last layer. However, it is possible to use a hold-out set to identify optimal positions tailored to specific models and tasks. We make this simplification to get rid of the requirement of training samples and aim to obtain better generalizability. Additionally, the layer weight vectors are shared across tokens, which means we need only two weight vectors: W layer ∈ R 1 × L and W token ∈ R 1 × N .

<!-- formula-not-decoded -->

where i is the index of the decision center, d i,j is the relative distance, and w &gt; 0 is a scalar parameter that controls the locality value. Locality is a metric that measures how much the weights of a weight vector are gathered in adjacent positions. Given a weight vector for the i -th position ϵ i = { ϵ i, 1 , ϵ i, 2 , ..., ϵ i,n } , the locality can be denoted as:

<!-- formula-not-decoded -->

Figure 2c shows the weights computed by Equation 4 with varied localities. This signifies that we can control the influence of neighboring layers and tokens during the averaging process.

Our proposed internal confidence is training-free and efficient, as it requires only a single forward pass of a given query. Since model responses are usually longer than input prompts and invoking

Figure 3: We use Internal Confidence of Phi-3.8B to predict whether the corresponding can distinguish known and unknown queries.

external services like RAG adds significant overhead. We hope this pre-generation uncertainty can support adaptive reasoning.

## 5 Experiments

## 5.1 Settings

Implementations We provide one positive and one negative example to prompt LLMs, and the target model should follow the examples to output answers. All LLMs use greedy decoding to have deterministic results. The decision center is fixed to the last layer and last token, and we set w = 1 0 . (Equation 4) for all models and datasets.

edge and it falls in its knowledge boundary. For the first two datasets with short answers, we consider an answer to be correct if its Rouge-L (Lin and Och, 2004) of the ground truth is greater than 0.3, which is consistent with prior work (Kuhn et al., 2023). For the GSM8K dataset, we use an LLM evaluator, Mistal-Large (MistralAI, 2024), to assess both reasoning steps and final answer. After that, we can obtain a binary label for each query, which shows if a model is able to address the query.

Models Three different sizes of LLMs are used in experiments: Phi-3-mini-4k-instruct (Abdin et al., 2024), Llama-3.1-8B-Instruct (Grattafiori et al., 2024), and Qwen2.5-14B-Instruct (Team, 2024). We aim to evaluate if internal confidence can be scaled to different model sizes. Note that internal confidence can be used for models without instruction tuning.

Datasets We evaluate on two factual QA datasets and one mathematical reasoning dataset: TriviaQA (Joshi et al., 2017), SciQ (Welbl et al., 2017), and GSM8K (Cobbe et al., 2021). The first two tasks aim to assess factual knowledge stored in parameters, while GSM8K requires models to selfevaluate their reasoning capabilities. Ground truth of factual QA tasks is a short answer with some entity facts. GSM8k calls for a short answer, but the intermediate reasoning steps have been evaluated as well, following prior work (Kadavath et al., 2022).

We ask a model to generate answers in a greedy decoding way. If the answer is aligned with ground truth, we regard that the model has sufficient knowl-

Baselines We adapt existing answer-level methods to quantify the pre-generation uncertainty, e.g., logit-based uncertainty. Given a query (including prompt words) x = ( x , . . . , x 1 N ) , we can obtain a probability for each token P x ( n | x &lt;n ) by performing a forward pass. (1) The baseline Max ( -log p ) measures the query's uncertainty by assessing the least likely token in the query (Manakul et al., 2023). (2) Predictive Entropy is defined as the entropy over the entire query tokens (Malinin and Gales, 2021):

<!-- formula-not-decoded -->

(3) Min-K Entropy combines the thoughts of the Max ( -log p ) and predictive entropy , which select the top-K of tokens from the query with the minimum token probability (Shi et al., 2024). (4) Attentional Entropy is an adapted version of the predictive entropy by performing a weighted sum:

<!-- formula-not-decoded -->

where α n is the attentional weights for the token x n . The intuition here is that tokens contribute to the semantic meanings in a different way, and we should

Figure 4: Left: We use estimated internal confidence scores to decide whether to invoke RAG. If the internal confidence exceeds a threshold, the model answers the query using its parametric knowledge. Otherwise, it relies on external knowledge for reasoning. The plot shows the accuracy of Phi-3.8B on the TriviaQA dataset under this setting. Right: We implement a model cascading seeting with Phi-3.8B (small) and Llama-8B (large) on the TriviaQA dataset. The internal confidence of the smaller model determines whether it answers the query or defers to the larger model when confidence is low.

not treat all tokens equally (Duan et al., 2024). (5) Perplexity reflects how uncertain a model is when predicting the next token:

<!-- formula-not-decoded -->

(6) Internal Semantic Similarity measures the average similarity among hidden states of different layers { h (1) N , ..., h ( L ) N } , which is inspired by the lexical similarity (Fomicheva et al., 2020). (7) P(Yes) is the probability of self-evaluation, which is described in Equation 4.2. (8) Internal Confidence (w/ naive avg) is a variant of our proposed internal confidence. The distinction is we apply a naive average to aggregate all scores.

observe that our proposed internal confidence can distinguish known and unknown queries better than other baselines (based on AUC and PRR) on average, especially for larger models such as Llama-8B and Qwen-14B. For example, the average AUC of Qwen-14B is 65.6, which is significantly higher than other baselines. Regarding the calibration (ECE), internal confidence can achieve lower error across models and tasks consistently. These findings indicate the effectiveness of internal confidence. Second, the variant, Internal Confidence ( w/ naive avg , leads to a decrease in general, which demonstrates that the benefit of using the attenuated encoding to obtain decay weights.

Evaluation Metrics We evaluate uncertainty by assessing whether a method can distinguish known and unknown queries, which can be treated as ranking problems, i.e., a lower uncertainty means a model is more likely to know the answer to the query. Following prior work (Manakul et al., 2023; Kuhn et al., 2023), we adopt the metrics Area Under the Curve (AUC) and Prediction Rejection Ratio (PRR) (Malinin et al., 2017) to measure this. Additionally, we use the Expected Calibration Error (ECE) to assess the calibration of different methods.

## 5.2 Internal Confidence Can Identify Known and Unknown Queries

Table 1 shows the overall performances of various query-level uncertainty methods. First, we can

Additionally, Figure 3 shows the how well the internal confidence can distinguish known and unknown queries across three tasks. While the results confirm that our training-free method can predict knowledge boundaries to some extent, there is still considerable room for improvement. We hope this initial effort encourages further research in this direction.

## 5.3 Internal Confidence Makes LLM Reasoning More Efficiently

Recent studies advance LLM reasoning by introducing additional resources, such as using RAG to obtain external knowledge (Lewis et al., 2020) and inference-time scaling to improve outputs (Snell et al., 2024). However, it is not always necessary to use additional resources, especially for simple queries. Here, we can use our proposed internal

Figure 5: Impacts of locality on validation sets.

confidence to determine when to invoke RAG, slow thinking, or model cascading.

## 5.4 Locality Impacts Uncertainty Performance

We conduct experiments for two scenarios: (1) Efficient RAG. Basically, the internal confidence can serve as a signal of the knowledge gaps of a model. If the score is greater than a threshold, the model is confidence to address the query. Otherwise, it requires the call of RAG. We use the TriviaQA dataset for evaluation. This dataset provides web search results for a query, which can be used as retrieved contexts for RAG. (2) Model Cascading. This task aims to achieve cost-performance trade-offs by coordinating small and large models (Dohan et al., 2022; Gupta et al., 2024). Smaller models is responsible for easy missions. If they are aware that the mission is hard to complete, it invokes a larger model. We use a two-model cascade setting with Phi-3.8B and Llama-8B on the TriviaQA dataset. Likewise, if the internal confidence of the smaller model is high, we do not invoke the larger model. Otherwise, the hard query is deferred to the larger model.

Figure 4 shows the results of efficient RAG and model cascading. The trade-off region means that we can carefully select a threshold to control the call of external services, which helps strike a balance between efficiency and performance. The benefit region indicates scenarios where the use of additional resources can be reduced without compromising performance. Results across the two tasks further confirm the effectiveness of Internal Confidence in identifying knowledge gaps. Our method offers practical benefits by reducing inference overhead, which is correlated with computation time and monetary cost.

We introduce attenuated encodings to aggregate probabilities centering around a decision point. The locality of the encoding may impact the performance of estimated uncertainties. To study the influence of the locality, we vary the w in Equation 4 to obtain encoding with different localities and observe how they can impact the estimations. Figure 5 shows the AUC across different datasets and models. We can observe that the locality is correlated with task types and model architecture. For example, Phi-3.8B prefers an extreme locality (1.0) while Qwen-14B has a certain optimal value around 0.8. Regarding different datasets, the influence of locality values displays slightly different behaviors. Although we may need to search an optimal locality for a specific task, we show that an empirical value with ( w = 1 0 . , Locality=0.72) can achieve competitive performances across models and datasets.

## 6 Conclusion

In this work, we propose a new concept called query-level uncertainty, which aims to assess whether a model can address a query without generating any tokens. To this end, we propose the approach, internal confidence, which leverages latent self-evaluation to identify the boundary of a model's knowledge. Experimental results verify the effectiveness of our approach in factual QA and mathematical reasoning. Furthermore, we apply internal confidence to two practical scenarios of adaptive inference, efficient RAG and model cascading. Our findings reveal that our method can identify two regions: a trade-off region and a benefit region. The former means that users can strike a balance between cost and quality by carefully selecting a threshold of confidence scores. The latter means that users can reduce inference overhead without compromising performance. Although our method can serve as a strong baseline for estimating querylevel uncertainty, there is still considerable room for improvement. We hope this study can stimulate future studies in this area.

## References

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, and 1 others. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 .

Alfonso Amayuelas, Kyle Wong, Liangming Pan, Wenhu Chen, and William Yang Wang. 2024. Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models. In Findings of the Association for Computational Linguistics ACL 2024 , pages 6416-6432.

Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it's lying. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 967-976.

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations .

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024a. Inside: Llms' internal states retain the power of hallucination detection. In ICLR .

Lihu Chen, Alexandre Perez-Lebel, Fabian Suchanek, and Gaël Varoquaux. 2024b. Reconfidencing llms from the grouping loss perspective. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 1567-1581.

Lihu Chen, Gael Varoquaux, and Fabian Suchanek. 2023. The locality and symmetry of positional encodings. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 1431314331.

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R Glass, and Pengcheng He. 2023. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations .

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .

Roi Cohen, Konstantin Dobler, Eden Biran, and Gerard de Melo. 2024. I don't know: Explicit modeling of uncertainty with an [idk] token. Advances in Neural Information Processing Systems , 37:10935-10958.

Armen Der Kiureghian and Ove Ditlevsen. 2009. Aleatory or epistemic? does it matter? Structural safety , 31(2):105-112.

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, and 1 others. 2022. Language model cascades. arXiv preprint arXiv:2207.10342 .

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. Shifting attention to relevance: Towards the predictive uncertainty quantification of freeform large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5050-5063.

Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics , 8:539-555.

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. 2024. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages 6577-6595.

Daniela Gottesman and Mor Geva. 2024. Estimating knowledge in large language models without generating a single token. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 3994-4019.

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad AlDahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 .

Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. 2024. Language model cascades: Token-level uncertainty and beyond. In The Twelfth International Conference on Learning Representations .

Stephen C Hora. 1996. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering &amp; System Safety , 54(2-3):217-223.

Eyke Hüllermeier and Willem Waegeman. 2021. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine learning , 110(3):457-506.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1601-1611.

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 .

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine M Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. 2024. Large language models must be taught to know what they don't know. In The Thirtyeighth Annual Conference on Neural Information Processing Systems .

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. 2024. Semantic entropy probes: Robust and cheap hallucination detection in llms. arXiv preprint arXiv:2406.15927 .

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations .

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems , 33:94599474.

Moxin Li, Yong Zhao, Yang Deng, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, and Tat-Seng Chua. 2024. Knowledge boundary of large language models: A survey. arXiv preprint arXiv:2412.12472 .

Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL04) , pages 605-612.

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with confidence: Uncertainty quantification for black-box large language models. Transactions on Machine Learning Research .

Andrey Malinin and Mark Gales. 2021. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations .

Andrey Malinin, Anton Ragni, Kate Knill, and Mark Gales. 2017. Incorporating uncertainty into deep learning for spoken language assessment. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 45-50.

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 9004-9017.

MistralAI. 2024. Mistral large: A general-purpose language model. https://mistral.ai/news/ mistral-large-2407/ .

Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, and Haifeng Wang. 2025. Investigating the factual knowledge boundary of large language models with retrieval augmentation. In Proceedings of the 31st International Conference on Computational Linguistics , pages 3697-3715.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems , 36:68539-68551.

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2024. Detecting pretraining data from large language models. In The Twelfth International Conference on Learning Representations .

Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. 2024. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions. arXiv preprint arXiv:2412.05563 .

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 .

Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. 2024. The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism. arXiv preprint arXiv:2407.10457 .

Asa Cooper Stickland and Iain Murray. 2020. Diverse ensembles improve calibration. arXiv preprint arXiv:2007.04206 .

Qwen Team. 2024. Qwen2.5: A party of foundation models.

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 5433-5442.

Christian Tomani, Kamalika Chaudhuri, Ivan Evtimov, Daniel Cremers, and Mark Ibrahim. 2024. Uncertainty-based abstention in llms improves safety and reduces hallucinations. arXiv preprint arXiv:2404.10960 .

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Daniil Vasilev, Akim Tsvigun, Sergey Petrakov, Rui Xing, Abdelrahman Sadallah, Kirill Grishchenkov, and 1 others. 2025. Benchmarking uncertainty quantification methods for large language models with lm-polygraph. Transactions of the Association for Computational Linguistics , 13:220-248.

Hongru Wang, Boyang Xue, Baohang Zhou, Tianhua Zhang, Cunxiang Wang, Huimin Wang, Guanhua Chen, and Kam-fai Wong. 2024. Self-dc: When to reason and when to act? self divide-and-conquer for compositional unknown questions. arXiv preprint arXiv:2402.13514 .

Johannes Welbl, Nelson F Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy Usergenerated Text , pages 94-106.

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2024. Know your limits: A survey of abstention in large language models. arXiv preprint arXiv:2407.18418 .

Zhihui Xie, Jizhou Guo, Tong Yu, and Shuai Li. 2024. Calibrating reasoning in language models with internal consistency. In The Thirty-eighth Annual Conference on Neural Information Processing Systems .

Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. 2024. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In The Twelfth International Conference on Learning Representations .

Xunjian Yin, Xu Zhang, Jie Ruan, and Xiaojun Wan. 2024. Benchmarking knowledge boundary for large language models: A different perspective on model evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2270-2286.

Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2024a. R-tuning: Instructing large language models to say 'i don't know'. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages 7106-7132.

Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. 2020. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International conference on machine learning , pages 11117-11128. PMLR.

Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, and Xipeng Qiu. 2024b. Calibrating the confidence of large language models by eliciting fidelity. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 29592979.

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, and 1 others. 2025. A survey on test-time scaling in large language models: What, how, where, and how well? arXiv preprint arXiv:2503.24235 .

## A Example Appendix

This is an appendix.

### UnstructuredLoader
- 다양한 비정형 문서들을 읽어 오는 Unstrctured 를 사용해, 다양한 형식의 문서들을 load 해 RAG, 모델 파인튜닝에 적용할 수있게 한다.
  - 지원 파일 형식: "csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"
- **다양한 형식의 파일로 부터 text를 로딩**해야 할 경우 유용하다. 
- Local에 library를 설치해서 사용하거나,  Unstructured 가 제공하는 API service를 사용할 수 있다.
  - https://docs.unstructured.io
- 텍스트 파일, PDF, 이미지, HTML, XML, ms-office(word, ppt), epub 등 다양한 비정형 데이터 파일을 처리할 수 있다.
  - 설치, 지원 문서: https://docs.unstructured.io/open-source/installation/full-installation
  - Langchain 문서: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file

> - UnstructuredLoader PDF Load 시 Document 분할 기준
>     -  문서의 구조와 콘텐츠를 기반으로 텍스트를 분할해 Document에 넣는다.
>     -  분할 기준
>        - 헤더(Header): 문서의 제목이나 섹션 제목 등
>        - 본문 텍스트(NarrativeText): 일반적인 문단이나 설명문
>        - 표(Table): 데이터가 표 형식으로 구성된 부분
>        - 리스트(List): 순서가 있거나 없는 목록
>        - 이미지(Image): 사진이나 그래픽 요소

#### 설치할 프로그램
- poppler
  - pdf 파일을 text로 변환하기 위해 필요한 프로그램
  - windows: https://github.com/oschwartz10612/poppler-windows/releases/ 에서 최신 버전 다운로드 후 압축 풀어서 설치.
    - 환경변수 Path에 "설치경로\Library\bin" 을 추가. (설치 후 IDE를 다시 시작한다.)
  - macOS: `brew install poppler`
  - Linux: `sudo apt-get install poppler-utils`
- tesseract-ocr
  - OCR 라이브러리로 pdf 이미지를 text로 변환하기 위해 필요한 프로그램 
  - windows: https://github.com/UB-Mannheim/tesseract/wiki 에서 다운받아 설치. 
    - 환경변수 Path에 설치 경로("C:\Program Files\Tesseract-OCR") 추가 한다. (설치 후 IDE를 다시 시작한다.)
  - macOS: `brew install tesseract`
  - linux(unbuntu): `sudo apt install tesseract-ocr`
- 설치 할 패키지
  - **libmagic 설치**
      - windows: `pip install python-magic-bin -qU`
      - macOS: `brew install libmagic`
      - linux(ubuntu): `sudo apt-get install libmagic-dev`
  - `pip install "unstructured[pdf]" -qU`
      - 문서 형식별로 sub module을 설치한다. (pdf, docx ..)
      - 모든 sub module 설치: `pip install unstructured[all-docs]`
      - https://docs.unstructured.io/open-source/installation/full-installation
  - `pip install langchain-unstructured -qU`

In [7]:
%pip install langchain-unstructured

Note: you may need to restart the kernel to use updated packages.


In [None]:
%pip install unstructured
# brew install qpdf  -> 실행 필요 -> pip install "unstructured[pdf]"

Collecting unstructured
  Using cached unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Collecting chardet (from unstructured)
  Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting python-magic (from unstructured)
  Using cached python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting nltk (from unstructured)
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting emoji (from unstructured)
  Using cached emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Using cached python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured)
  Using cached langdetect-1.0.9-py3-none-any.whl
Collecting rapidfuzz (from unstructured)
  Using cached rapidfuzz-3.13.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting backoff (from unstructured)
  Using cached backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting wrapt (from unstructured)
  Using cached wr

In [12]:
from langchain_unstructured import UnstructuredLoader
# path = "data/olympic.txt"
# path = "papers/1.pdf"
path = ["data/olympic.txt", "papers/1.pdf"]
loader = UnstructuredLoader(path)
docs = loader.load()

INFO: pikepdf C++ to Python logger bridge initialized


In [13]:
len(docs)

465

In [14]:
docs[300].metadata

{'source': 'papers/1.pdf',
 'coordinates': {'points': ((70.866, 338.5113216),
   (70.866, 403.26892159999994),
   (290.7855161390001, 403.26892159999994),
   (290.7855161390001, 338.5113216)),
  'system': 'PixelSpace',
  'layout_width': 595.276,
  'layout_height': 841.89},
 'file_directory': 'papers',
 'filename': '1.pdf',
 'languages': ['eng'],
 'last_modified': '2025-06-13T08:55:22',
 'page_number': 10,
 'parent_id': 'c0ff9d37ae73259855ed24d509a77b06',
 'filetype': 'application/pdf',
 'category': 'NarrativeText',
 'element_id': 'e1d2951b77c3895facdfdac2c0f93dec'}

In [15]:
docs[10].page_content

'각 올림픽 종목들은 IOC로부터 승인을 받은 국제경기연맹의 관리를 받는다. 35개의 연맹이 IOC에서 승인을 받았으며, 승인을 받았지만 현재 정식종목이 아닌 종목을 감독하는 연맹도 있다. IOC의 승인을 받았지만 올림픽 종목이 아닌 스포츠들은 올림픽 종목으로 고려되지는 않으나, 올림픽이 끝난 후 처음으로 열리는 IOC총회 때마다 정식종목이 되도록 신청을 할 수는 있다. IOC 총회 때 정식종목 선정은 총회에 참석중인 IOC위원들의 투표를 통해 이루어지며, 재적 위원 수의 과반수 이상 찬성표를 얻어야 정식종목으로 인정을 받는다. IOC의 승인을 받은 스포츠이나 찬성표를 받지 못해 정식종목이 되지 못한 스포츠로는 체스와 서핑과 같은 것이 있다.'

### Directory 내의 문서파일들 로딩
- DirectoryLoader 이용

In [16]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "data", # 읽어들일 문서들이 있는 디렉토리.
    recursive=True, # 하위디렉토리까지 검색할지 여부.
)
docs = loader.load()
len(docs)



15

In [17]:
idx = 13
docs[idx].metadata
docs[idx].page_content

"1\n\n배따라기\n\nExported from Wikisource on 2024년 11월 24일\n\n2\n\n🙝🙟\n\n좋은 일기이다.\n\n좋은 일기라도, 하늘에 구름 한 점 없는 - 우리 ‘사람’으로서 는 감히 접근 못할 위엄을 가지고, 높이서 우리 조그만 ‘사 람’을 비웃는 듯이 내려다보는, 그런 교만한 하늘은 아니고, 가장 우리 ‘사람’의 이해자인 듯이 낮추 뭉글뭉글 엉기는 분 홍빛 구름으로서 우리와 서로 손목을 잡자는 그런 하늘이다. 사랑의 하늘이다.\n\n나는 잠시도 멎지 않고, 푸른 물을 황해로 부어 내리는 대동 강을 향한, 모란봉 기슭 새파랗게 돋아나는 풀 위에 뒹굴고 있었다.\n\n이날은 삼월 삼질, 대동강에 첫 뱃놀이하는 날이다. 까맣게 내려다보이는 물 위에는, 결결이 반짝이는 물결을 푸른 놀잇 배들이 타고 넘으며, 거기서는 봄 향기에 취한 형형색색의 선율이, 우단보다도 부드러운 봄 공기를 흔들면서 날아온다.\n\n그리고 거기서 기생들의 노래와 함께 날아오는 조선 아악 (雅樂)은 느리게, 길게, 유장하게, 부드럽게, 그리고 또 애처 롭게, 모든 봄의 정다움과 끝까지 조화하지 않고는 안두겠다 는 듯이 대동강에 흐르는 시꺼먼 봄 물, 청류벽에 돋아나는 푸르른 푸러음, 심지어 사람의 가슴속에 봄에 뛰노는 불붙는 핏줄기까지라도, 습기 많은 봄 공기를 다리 놓고 떨리지 않 고는 두지 않는다.\n\n봄이다. 봄이 왔다.\n\n3\n\n부드럽게 부는 조그만 바람이, 시꺼먼 조선 솔을 꿰며, 또는 돋아나는 풀을 스치고 지나갈 때의 그 음악은, 다른 데서는 듣지 못할 아름다운 음악이다.\n\n아아, 사람을 취케 하는 푸르른 봄의 아름다움이여! 열 다섯 살부터의 동경(東京) 생활에, 마음껏 이런 봄을 보지 못하였 던 나는, 늘 이것을 보는 사람보다 곱 이상의 감명을 여기서 받지 않을 수 없다.\n\n평양성 내에는, 겨우 툭툭 터진 땅을 헤치면 파릇파릇 돋아 나는 나무새기와 돋아나려는 버들의 어음으로 봄이 온 줄 알 뿐, 아직 완전히 봄이 안 이르렀지만, 

In [18]:
loader = DirectoryLoader(
    "data", # 읽어들일 문서들이 있는 디렉토리.
    glob=["*.txt"],   # 읽을 파일들의 확장자를 지정.
    recursive=False, # 하위디렉토리까지 검색할지 여부.
)
docs = loader.load()
len(docs)

3

In [19]:
len(docs)

3

In [20]:
docs[2].metadata

{'source': 'data/olympic.txt'}

# Chunking (문서 분할)

![rag_split](figures/rag_split.png)

- Load 한 문서를 지정한 기준의 덩어리(chunk)로 나누는 작업을 진행한다.

## 나누는 이유
1. **임베딩 모델의 컨텍스트 길이 제한**
    - 대부분의 언어 모델은 한 번에 처리할 수 있는 토큰 수에 제한이 있다. 전체 문서를 통째로 입력하면 이 제한을 초과할 수 있어 처리가 불가능해진다.
2. **검색 정확도 향상**
    - 큰 문서 전체보다는 특정 주제나 내용을 다루는 작은 chunk가 사용자 질문과 더 정확하게 매칭된다. 예를 들어, 100페이지 매뉴얼에서 특정 기능에 대한 질문이 있을 때, 해당 기능을 설명하는 몇 개의 문단만 검색되는 것이 더 효과적이다.
    - 사용자 질문에 대해 문서의 모든 내용이 다 관련있는 것은 아니다. Chunking을 통해 가장 관련성 높은 부분만 선별적으로 활용할 수 있어 답변의 품질이 향상된다.
    - 전체 문서에는 질문과 무관한 내용들이 많이 포함되어 있어 모델이 혼란을 겪을 수 있다. 적절한 크기의 chunk는 이런 노이즈를 줄여준다.
3. **계산 효율성**
    - 벡터 유사도 계산, 임베딩 생성 등의 작업이 작은 chunk 단위로 수행될 때 더 빠르고 효율적이다. 메모리 사용량도 줄일 수 있다.

## 주요 Spliter
- https://api.python.langchain.com/en/latest/text_splitters_api_reference.html

### CharacterTextSplitter
가장  기본적인 Text spliter
- 한개의 구분자를 기준으로 분리한다. (default: "\n\n")
    - 분리된 조각이 chunk size 보다 작으면 다음 조각과 합칠 수 있다.
        - 합쳤을때 chuck_size 보다 크면 안 합친다. chuck_size 이내면 합친다.
    - 나누는 기준은 구분자이기 때문에 chunk_size 보다 글자수가 많을 수 있다.
- chunk size: 분리된 문서(chunk) 글자수 이내에서 분리되도록 한다.
    -  구분자를 기준으로 분리한다. 구분자를 기준으로 분리한 문서 조각이 chunk size 보다 크더라도 그대로 유지한다. 즉 chunk_size가 우선이 아니라 **seperator** 가 우선이다.
- 주요 파라미터
    - chunk_size: 각 조각의 최대 길이를 지정.
    - seperator: 구분 문자열을 지정. (default: '\n\n')
- CharacterTextSplitter는 단순 스플리터로 overlap기능을 지원하지는 않는다. 단 seperator가 빈문자열("") 일 경우에는 overlap 기능을 지원한다. overlap이란 각 이전 청크의 뒷부분의 문자열을 앞에 붙여 문맥을 유지하는 것을 말한다.
  
### RecursiveCharacterTextSplitter
- RecursiveCharacterTextSplitter는 **긴 텍스트를 지정된 최대 길이(chunk_size) 이하로 나누는 데 효과적인 텍스트 분할기**(splitter)이다.
- 여러 **구분자(separators)를 순차적으로 적용**하여, 가능한 한 자연스러운 문단/문장/단어 단위로 분할하고, 최종적으로는 크기 제한을 만족시킨다.
- 분할 기준 문자
    1. 두 개의 줄바꿈 문자 ("\n\n")
    2. 한 개의 줄바꿈 문자 ("\n")
    3. 공백 문자 (" ")
    4. 빈 문자열 ("")
- 작동 방식
    1. 먼저 가장 높은 우선순위의 구분자("\n\n")로 분할을 시도한다.
    2. 분할된 조각 중 **chunk_size를 초과하는 조각**에 대해 다음 우선순위 구분자("\n" → " " → "")로 재귀적으로 재분할한다.
    3. 이 과정을 통해 모든 조각(chunk)이 chunk_size를 초과하지 않도록 만든다.  
- 주요 파라미터
    - chunk_size: 각 조각의 최대 길이를 지정.
    - chunk_overlap: 연속된 청크들 간의 겹치는 문자 수를 설정. 새로운 청크 생성 시 이전 청크의 마지막 부분에서 지정된 수만큼의 문자를 가져와서 새 청크의 앞부분에 포함시켜, 청크 경계에서 문맥의 연속성을 유지한다.
      - 구분자에 의해 청크가 나눠지면 정상적인 분리이므로 overlap이 적용되지 않는다.
      - 정상적 구분자로 나눌 수 없어 chunk_size에 맞춰 잘라진 경우 문맥의 연결성을 위애 overlap을 적용한다.
    - separators(list): 구분자를 지정한다. 지정하면 기본 구분자가 지정한 것으로 변경된다.

#### 메소드
- `split_documents(Iterable[Document]) : List[Document]`
    - Document 목록을 받아 split 처리한다.
- `split_text(str) : List[str]`
    - string text를 받아서 split 처리한다. 

In [21]:
text = """가각간갇갈갉갊감갑값갓갔강갖갗같갚갛개객갠갤갬갭갯갰

aadlskfjadklsfjakldfjadklsjadfskl갸갹갼걀걋걍걔걘걜거걱건걷걸걺검겁것겉겊겋게겐

띱띳띵라락란랄람랍랏랐

랑랒랖랗래랙랜랠램랩랫랬랭랴략랸럇량러럭런럴럼럽럿렀렁렇레렉렌렐렘렙렛렝나낙낚ASDFFGHJJKKLLLQWE

멨멩며 

멱면멸몃몄명몇몌모목몫몬몰몲몸몹못몽뫄뫈뫘뫙뫼묀묄묍묏묑묘묜묠묩묫무묵묶문묻물묽묾뭄뭅뭇뭉뭍뭏ABCDEFGHIJ"""

In [32]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# chunk_size > chunk_overlap   =>   국룰
spliter = CharacterTextSplitter(
    chunk_size = 60,                                    # 60글자로 자름
    chunk_overlap = 10,                                 # 겹치는 문자 수 지정
    # separator = ""                                      # defaul : "\n\n"
)
docs = spliter.split_text(text)
len(docs)

6

In [33]:
for doc in docs:
    print(len(doc))
    print(doc)

26
가각간갇갈갉갊감갑값갓갔강갖갗같갚갛개객갠갤갬갭갯갰
56
aadlskfjadklsfjakldfjadklsjadfskl갸갹갼걀걋걍걔걘걜거걱건걷걸걺검겁것겉겊겋게겐
11
띱띳띵라락란랄람랍랏랐
56
랑랒랖랗래랙랜랠램랩랫랬랭랴략랸럇량러럭런럴럼럽럿렀렁렇레렉렌렐렘렙렛렝나낙낚ASDFFGHJJKKLLLQWE
3
멨멩며
57
멱면멸몃몄명몇몌모목몫몬몰몲몸몹못몽뫄뫈뫘뫙뫼묀묄묍묏묑묘묜묠묩묫무묵묶문묻물묽묾뭄뭅뭇뭉뭍뭏ABCDEFGHIJ


In [35]:
from langchain_core.documents import Document
document = Document(page_content=text)
docs2 = spliter.split_documents([document])
type(docs2), len(docs2), type(docs2[0])

(list, 6, langchain_core.documents.base.Document)

In [36]:
for d in docs2:
    print(d)

page_content='가각간갇갈갉갊감갑값갓갔강갖갗같갚갛개객갠갤갬갭갯갰'
page_content='aadlskfjadklsfjakldfjadklsjadfskl갸갹갼걀걋걍걔걘걜거걱건걷걸걺검겁것겉겊겋게겐'
page_content='띱띳띵라락란랄람랍랏랐'
page_content='랑랒랖랗래랙랜랠램랩랫랬랭랴략랸럇량러럭런럴럼럽럿렀렁렇레렉렌렐렘렙렛렝나낙낚ASDFFGHJJKKLLLQWE'
page_content='멨멩며'
page_content='멱면멸몃몄명몇몌모목몫몬몰몲몸몹못몽뫄뫈뫘뫙뫼묀묄묍묏묑묘묜묠묩묫무묵묶문묻물묽묾뭄뭅뭇뭉뭍뭏ABCDEFGHIJ'


In [37]:
spliter2 = RecursiveCharacterTextSplitter(
    chunk_size = 50,
    chunk_overlap = 10
    # ,separators=["첫번쨰 구분자", "두번째 구분자", ...]
)

result = spliter2.split_text(text)
print(type(result), len(result))

<class 'list'> 9


In [38]:
for r in result:
    print(len(r), r, sep="||")
    print("="*70)

26||가각간갇갈갉갊감갑값갓갔강갖갗같갚갛개객갠갤갬갭갯갰
49||aadlskfjadklsfjakldfjadklsjadfskl갸갹갼걀걋걍걔걘걜거걱건걷걸걺검
17||걔걘걜거걱건걷걸걺검겁것겉겊겋게겐
11||띱띳띵라락란랄람랍랏랐
49||랑랒랖랗래랙랜랠램랩랫랬랭랴략랸럇량러럭런럴럼럽럿렀렁렇레렉렌렐렘렙렛렝나낙낚ASDFFGHJJK
17||ASDFFGHJJKKLLLQWE
3||멨멩며
49||멱면멸몃몄명몇몌모목몫몬몰몲몸몹못몽뫄뫈뫘뫙뫼묀묄묍묏묑묘묜묠묩묫무묵묶문묻물묽묾뭄뭅뭇뭉뭍뭏AB
18||묽묾뭄뭅뭇뭉뭍뭏ABCDEFGHIJ


In [42]:
result2 = spliter2.split_documents([document])
result2

[Document(metadata={}, page_content='가각간갇갈갉갊감갑값갓갔강갖갗같갚갛개객갠갤갬갭갯갰'),
 Document(metadata={}, page_content='aadlskfjadklsfjakldfjadklsjadfskl갸갹갼걀걋걍걔걘걜거걱건걷걸걺검'),
 Document(metadata={}, page_content='걔걘걜거걱건걷걸걺검겁것겉겊겋게겐'),
 Document(metadata={}, page_content='띱띳띵라락란랄람랍랏랐'),
 Document(metadata={}, page_content='랑랒랖랗래랙랜랠램랩랫랬랭랴략랸럇량러럭런럴럼럽럿렀렁렇레렉렌렐렘렙렛렝나낙낚ASDFFGHJJK'),
 Document(metadata={}, page_content='ASDFFGHJJKKLLLQWE'),
 Document(metadata={}, page_content='멨멩며'),
 Document(metadata={}, page_content='멱면멸몃몄명몇몌모목몫몬몰몲몸몹못몽뫄뫈뫘뫙뫼묀묄묍묏묑묘묜묠묩묫무묵묶문묻물묽묾뭄뭅뭇뭉뭍뭏AB'),
 Document(metadata={}, page_content='묽묾뭄뭅뭇뭉뭍뭏ABCDEFGHIJ')]

In [40]:
# olympic.txt를 읽어서 split 처리
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. 문서 Load
path = "data/olympic.txt"
loader = TextLoader(path, encoding="utf-8")
docs = loader.load()

# 2. load한 문서를 split
spliter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = spliter.split_documents(docs)
print(len(split_docs))

61


In [41]:
len_list = [len(d.page_content) for d in split_docs]                        # split된 문서들의 글자수
len_list

[3,
 498,
 403,
 414,
 239,
 262,
 5,
 496,
 222,
 270,
 338,
 310,
 434,
 456,
 498,
 5,
 496,
 135,
 446,
 413,
 4,
 498,
 120,
 356,
 233,
 400,
 392,
 310,
 221,
 496,
 226,
 428,
 362,
 495,
 379,
 311,
 355,
 268,
 405,
 2,
 495,
 495,
 242,
 362,
 493,
 374,
 236,
 329,
 297,
 459,
 498,
 154,
 401,
 444,
 466,
 352,
 499,
 111,
 10,
 498,
 217]

In [43]:
idx = 1
print(split_docs[idx].page_content)

올림픽(영어: Olympic Games, 프랑스어: Jeux olympiques)은 전 세계 각 대륙 각국에서 모인 수천 명의 선수가 참가해 여름과 겨울에 스포츠 경기를 하는 국제적인 대회이다. 전 세계에서 가장 큰 지구촌 최대의 스포츠 축제인 올림픽은 세계에서 가장 인지도있는 국제 행사이다. 올림픽은 2년마다 하계 올림픽과 동계 올림픽이 번갈아 열리며, 국제 올림픽 위원회(IOC)가 감독하고 있다. 또한 오늘날의 올림픽은 기원전 8세기부터 서기 5세기에 이르기까지 고대 그리스 올림피아에서 열렸던 올림피아 제전에서 비롯되었다. 그리고 19세기 말에 피에르 드 쿠베르탱 남작이 고대 올림피아 제전에서 영감을 얻어, 근대 올림픽을 부활시켰다. 이를 위해 쿠베르탱 남작은 1894년에 IOC를 창설했으며, 2년 뒤인 1896년에 그리스 아테네에서 제 1회 올림픽이 열렸다. 이때부터 IOC는 올림픽 운동의 감독 기구가 되었으며, 조직과 활동은 올림픽 헌장을 따른다. 오늘날 전 세계 대부분의


In [44]:
from langchain_community.document_loaders import PyPDFLoader

# 문서 로드
path = "data/novel/메밀꽃_필_무렵_이효석.pdf"
loader = PyPDFLoader(path)
docs = loader.load()

# split
split_docs = spliter.split_documents(docs)
len(split_docs)

39

In [45]:
idx = 0
print(split_docs[idx].page_content)

1 
메밀꽃  필  무렵
Exported from Wikisource on 2024 년  11 월  24 일


## Token 수 기준으로 나누기

- LLM 언어 모델들은 입력 토큰 수 제한이 있어서 요청시 제한 토큰수 이상의 프롬프트는 전송할 수 없다.
- 따라서 텍스트를 chunk로 분할할 때는 글자수 보다 **토큰 수를 기준으로 크기를 지정하는 것**이 좋다.  
- 토큰기반 분할은 텍스트의 의미를 유지하면서 분할하는 방식이므로 문자 기반 분할과 같이 단어가 중간잘리는 것들을 방지할 수 있다. 
- 토큰 수 계산할 때는 사용하는 언어 모델에 사용된 것과 동일한 tokenizer를 사용하는 것이 좋다.
  - 예를 들어 OpenAI의 GPT 모델을 사용할 경우 tiktoken 라이브러리를 활용하여 토큰 수를 정확하게 계산할 수 있다.

### [tiktoken](https://github.com/openai/tiktoken) tokenizer 기반 분할
- OpenAI에서 GPT 모델을 학습할 때 사용한 `BPE` 방식의 tokenizer. **OpenAI 언어모델을 사용할 경우 이것을 사용하는 것이 좀 더 정확하게  토큰dmf 계산할 수 있다.**
- Splitter.from_tiktoken_encoder() 메소드를 이용해 생성
  - `RecursiveCharacterTextSplitter.from_tiktoken_encoder()`
  - `CharacterTextSplitter.from_tiktoken_encoder()`
- 파라미터
  - encode_name: 인코딩 방식(토큰화 규칙)을 지정. OpenAI는 GPT 모델들 마다 다른 방식을 사용했다. 그래서 사용하려는 모델에 맞는 인코딩 방식을 지정해야 한다.
    - `cl100k_base`: GPT-4 및 GPT-3.5-Turbo 모델에서 사용된 방식.
    - `r50k_base:` GPT-3 모델에서 사용된 방식 
  - chunk_size, chunk_overlap, separators 파라미터 (위와 동일)
- tiktoken 설치
  - `pip install tiktoken`

### HuggingFace Tokenizer
- HuggingFace 모델을 사용할 경우 그 모델이 사용한 tokenizer를 이용해 토큰 기반으로 분할 한다.
  - 다른 tokenizer를 이용해 분할 할 경우 토큰 수 계산이 다르게 될 수있다.
- `from_huggingface_tokenizer()` 메소드를 이용.
  - 파라미터
    - tokenizer: HuggingFace tokenizer 객체
    - chunk_size, chunk_overlap, separators 파라미터 (위와 동일)
- `transformers` 라이브러리를 설치해야 한다.
  - `pip install transformers` 

In [48]:
loader = TextLoader("data/olympic.txt", encoding="utf-8")

spliter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4o-mini",                                       # 지정한 모델을 학습할 때 사용한 tokenizer를 사용
    chunk_size=200,                                                 # 토큰수 기준
    chunk_overlap=0
)
docs = loader.load_and_split(spliter)
len(docs)

92

In [49]:
print(docs[1].page_content)

올림픽(영어: Olympic Games, 프랑스어: Jeux olympiques)은 전 세계 각 대륙 각국에서 모인 수천 명의 선수가 참가해 여름과 겨울에 스포츠 경기를 하는 국제적인 대회이다. 전 세계에서 가장 큰 지구촌 최대의 스포츠 축제인 올림픽은 세계에서 가장 인지도있는 국제 행사이다. 올림픽은 2년마다 하계 올림픽과 동계 올림픽이 번갈아 열리며, 국제 올림픽 위원회(IOC)가 감독하고 있다. 또한 오늘날의 올림픽은 기원전 8세기부터 서기 5세기에 이르기까지 고대 그리스 올림피아에서 열렸던 올림피아 제전에서 비롯되었다. 그리고 19세기 말에 피에르 드 쿠베르탱 남작이 고대 올림피아


In [50]:
# Huggingface Tokenizer
from transformers import AutoTokenizer
model_id = "beomi/kcbert-base"                      # 사용할 LLM 모델의 ID
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [52]:
spliter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=300,
    chunk_overlap=0
)
docs = loader.load_and_split(spliter)
len(docs)

47