# RAG
Retrieval Augmented Generation (검색증강생성)

In [None]:
# 사전학습된 모델은 이미 많은 데이터를 통해 학습한 상태이긴 하나..
# 개인 DB 나 회사내 문서 와 같이 'private 한 데이터' 들에는 접근할수 없다
# 그래서 RAG 를 사용한다!

# 다양한 RAG 기법들
- ppt 참조

In [None]:
"""
어떤 방식으로 RAG 를 구현할른지는

- 우리가 얼마나 많은 문서들을 가지고 있는지
- 우리가 얼마나 많은 비용으로 운영할지 (어떤 모델, 가용한 token 개수등..)

등에 따라 결정될 문제다.

"""
None

# Retrieve 란

https://python.langchain.com/v0.1/docs/modules/data_connection/

![](https://python.langchain.com/v0.1/assets/images/data_connection-95ff2033a8faa5f3ba41376c0f6dd32a.jpg)


In [None]:
# RAG 의 첫번째 단계인 Retrieval 의 일반적인 과정
# - data source 에서 데이터 load
# - 데이터는 split 하면서 transform
# - transform 한 데이터를 embed.
# - embed 된 데이터를 store 에 저장.

# Data Loader

In [1]:
# 랭체인에서 제공하는 다양한 document loader 들이 있다
# CSV, File Directory, HTML, JSON, Markdown, PDF 등
# ※그 밖에서도 3rd party loader 들도 있다.

In [None]:
"""
Data Loader 는 소스에서 데이터를 추출하고 langchain 에 가져다 주는 코드다.

정말 많은 document loader source 들이 제공된다. (함 보자 ↓)
https://python.langchain.com/docs/integrations/document_loaders/#all-document-loaders

GitHub, Figma, Facebook Caht, MS power point, slack, telegram, trello, Twitter 등..
전부다 랭체인에서 활용해볼수 있다는 것이다.

다양한 Data Loader 이지만 거의 동일한 API 인터페이스로 설계되어 있다.
"""
None

## 파일 준비

In [None]:
# 아래와 같이 파일들을 준비합니다
# 구글드라이브 사용자는 자신의 구글드라이브 공간에 생성해두시길 바랍니다

# 출처는  조지오웰의 소설 '1984' Part1 Chapter1
#  http://www.george-orwell.org/1984/0.html

# 너무 길거나, 너무 짧지 않으면 좋습니다
# 파일이 너무 길면 나중에 임베딩 과정에서 비용지출이 발생.

In [2]:
import os
from langchain_openai.chat_models.base import ChatOpenAI
llm = ChatOpenAI(temperature=0.1)

In [3]:
base_path = r'D:\Lang2505\dataset\files'

## TextLoader

In [4]:
# v0.3
from langchain_community.document_loaders.text import TextLoader
# https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.text.TextLoader.html
# Load text file.


In [5]:
loader = TextLoader(os.path.join(base_path, 'chapter_one.txt'))

In [7]:
docs = loader.load() # -> List[Document] 리턴
docs

[Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.txt'}, page_content="Part 1, Chapter 1\n\nPart One\n\n\n1\nIt was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.\n\nThe hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. T

In [8]:
len(docs)

1

# PyPDFLoader

In [9]:
# v0.3
from langchain_community.document_loaders.pdf import PyPDFLoader

# https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html
# PyPDFLoader document loader integration

In [10]:
loader = PyPDFLoader(os.path.join(base_path, 'chapter_one.pdf'))
docs = loader.load()
print(len(docs))
docs

15


[Document(metadata={'producer': 'Microsoft® Word 2016', 'creator': 'Microsoft® Word 2016', 'creationdate': '2025-01-30T23:19:00+09:00', 'author': 'Yeonchul Sung', 'moddate': '2025-01-30T23:19:00+09:00', 'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Part 1, Chapter 1 \n \n \nPart One \n \n \n1 \nIt was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his \nchin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through \nthe glass doors of Victory Mansions, though not quickly enough to prevent a swirl of \ngritty dust from entering along with him. \n \nThe hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured \nposter, too large for indoor display, had been tacked to the wall. It depicted simply an \nenormous face, more than a metre wide: the face of a man of about forty-five, with a \nheavy black moustache and ruggedly handsome

## UnstructuredFileLoader

In [11]:
"""
매 타입마다 서로 다른 포맷의 데이터를 읽어오기 보다
UnstructuredFileLoader 라는 것을 사용해볼수도 있다.

이를 사용하면 다양한 파일들을 읽어올수 있다.
"""
None

In [12]:
# v0.3
from langchain_community.document_loaders.unstructured import UnstructuredFileLoader

# https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.unstructured.UnstructuredFileLoader.html
# PyPDFLoader document loader integration

In [13]:
loader = UnstructuredFileLoader(os.path.join(base_path, 'chapter_one.pdf'))
docs = loader.load()
print(len(docs))
docs

  loader = UnstructuredFileLoader(os.path.join(base_path, 'chapter_one.pdf'))
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


1


[Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.pdf'}, page_content="Part 1, Chapter 1\n\nPart One\n\n1\n\nIt was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his\n\nchin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through\n\nthe glass doors of Victory Mansions, though not quickly enough to prevent a swirl of\n\ngritty dust from entering along with him.\n\nThe hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured\n\nposter, too large for indoor display, had been tacked to the wall. It depicted simply an\n\nenormous face, more than a metre wide: the face of a man of about forty-five, with a\n\nheavy black moustache and ruggedly handsome features. Winston made for the stairs. It\n\nwas no use trying the lift. Even at the best of times it was seldom working, and at\n\npresent the electric current was cut off during daylight hours. It was part of the economy\n\ndrive in p

In [14]:
loader = UnstructuredFileLoader(os.path.join(base_path, 'chapter_one.txt'))
docs = loader.load()
print(len(docs))
docs

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


1


[Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.txt'}, page_content="Part 1, Chapter 1\n\nPart One\n\n1 It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.\n\nThe hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. The 

In [15]:
loader = UnstructuredFileLoader(os.path.join(base_path, 'chapter_one.docx'))
docs = loader.load()
print(len(docs))
docs

1


[Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.docx'}, page_content="Part 1, Chapter 1\n\nPart One\n\n\n1\nIt was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.\n\nThe hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. 

# Splitter

## data를 split 해야 하는 이유

In [None]:
# 특정 질문에 답해야 하기 위해서, 필요한 '파일의 일부분' 만들 전달해야 할 수도 있다.

#  그래서 문서를 쪼개두어야(split) 한다

# 가령: "Ministry of peace" 를 찾고자 한다면.
# 해당 키워드가 있는 문서(들)만 모델에 넘겨주면 된다.

# 작은 조각들로 쪼개어 두면 필요한 것들을 찾기가 용이해진다.


## RecursiveCharacterTextSplitter

In [16]:
# v0.3
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
# https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html

# Splitting text by recursively look at characters.
# Recursively tries to split by different characters to find one that works.
# Create a new TextSplitter.

In [19]:
splitter = RecursiveCharacterTextSplitter()
# RecursiveCharacterTextSplitter 는 파일을 split 해주는데
# 문장의 끝이나, 문단의 끝부분마다 끊어준다.
# 문장 중간을 끊지는 않는다.  최대한 문장 중간에서 split 되지 않도록 하려 한다.
# 문장 중간에 짤림으로 의미있는 문장들을 잃고 싶지 않다.

# ↓ splitter 사용방법은 두가지 가 있다.

In [20]:
docs = loader.load()

In [21]:
# 방법1
documents = splitter.split_documents(docs)  # -> List[Document]

print(len(documents))
documents

11


[Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.docx'}, page_content="Part 1, Chapter 1\n\nPart One\n\n\n1\nIt was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.\n\nThe hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. 

In [23]:
print(documents[0].page_content)

Part 1, Chapter 1

Part One


1
It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. The flat was seven flights up, and Winston, who was thirty-nine and had a varicose ulcer above his righ

In [None]:
# splitter 를 사용하면 문장, 문단의 구조를 유지하면서 문서 분할.

### chunk_size=

In [None]:
# 좀 더 작은 Document 를 만들 필요가 있다.
# 모델의 Context Window 가 크지 않은 경우라든지..
# chunk_size= 값으로 조정해보자

In [24]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,  
)
# 방법2
documents = loader.load_and_split(text_splitter=splitter)
print(len(documents))

documents[:5]

3498


[Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.docx'}, page_content='Part 1, Chapter 1\n\nPart One'),
 Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.docx'}, page_content='1'),
 Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.docx'}, page_content='It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors'),
 Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.docx'}, page_content='was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of'),
 Document(metadata={'source': 'D:\\Lang2505\\dataset\\files\\chapter_one.docx'}, page_content='cold day in April, and the clocks were striking thirteen. Winston Smith, his chin

In [None]:
"""
↑ 보다시피 Document 한 덩어리가 작아진 걸 확인할수 있다.

그러나 자세히 보라!  문제가 발생했다! => 문단의 중간을 잘라버렸다.
아런식으로 잘라먹으면 그닥 쓸만하지 않다. <- 문장을 파괴해버린셈이다. (의미상 말이 안되는 문장들이 나온다)

작은 덩어리이면서도 중간을 잘라먹지 않는 방법은 없을까?
=> chunk_overlap=
    이 속성은 문장이나 문단을 분할할 때 앞 조각 일부분을 가져오게 만든다.
    앞 조각의 끝부분을 조금 가져와서 다음 조각에 연결시키는 거다.
    이 경우 Document 사이ㅐ에는 곂치는 부분이 생길수 있다. (중복된 부분)
    어떤 Document 의 끝부분이 다른 Document 의 시작점이 되는 거다.
"""
None

### chunk_overlap=

In [25]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,  
    chunk_overlap=50,
)

documents = loader.load_and_split(text_splitter=splitter)
print(len(documents))

# 대력 겹치는 부분 확인
for document in documents[10:15]:
    print('🔷', document.page_content)

250
🔷 move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
🔷 Inside the flat a fruity voice was reading out a list of figures which had something to do with the production of pig-iron. The voice came from an oblong metal plaque like a dulled mirror which
🔷 an oblong metal plaque like a dulled mirror which formed part of the surface of the right-hand wall. Winston turned a switch and the voice sank somewhat, though the words were still distinguishable.
🔷 though the words were still distinguishable. The instrument (the telescreen, it was called) could be dimmed, but there was no way of shutting it off completely. He moved over to the window: a
🔷 it off completely. He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very


## CharacterTextSplitter

In [26]:
# v0.3
from langchain_text_splitters.character import CharacterTextSplitter
# https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html

# Splitting text that looks at characters.
# Create a new TextSplitter.

In [27]:
splitter = CharacterTextSplitter(
    separator='\n',  # 줄바꿈 단락별로 쪼개기
    chunk_size=600,   # 최개 글자개수 600 이하로 쪼갬.
    chunk_overlap=100,
)

documents = loader.load_and_split(text_splitter=splitter)
print(len(documents))

# 대력 겹치는 부분 확인
for document in documents[10:15]:
    print('🔷', document.page_content)

Created a chunk of size 963, which is longer than the specified 600
Created a chunk of size 774, which is longer than the specified 600
Created a chunk of size 954, which is longer than the specified 600
Created a chunk of size 922, which is longer than the specified 600
Created a chunk of size 881, which is longer than the specified 600
Created a chunk of size 821, which is longer than the specified 600
Created a chunk of size 700, which is longer than the specified 600
Created a chunk of size 745, which is longer than the specified 600
Created a chunk of size 735, which is longer than the specified 600
Created a chunk of size 671, which is longer than the specified 600
Created a chunk of size 991, which is longer than the specified 600
Created a chunk of size 990, which is longer than the specified 600
Created a chunk of size 1289, which is longer than the specified 600
Created a chunk of size 1605, which is longer than the specified 600
Created a chunk of size 1900, which is longer 

46
🔷 Winston turned round abruptly. He had set his features into the expression of quiet optimism which it was advisable to wear when facing the telescreen. He crossed the room into the tiny kitchen. By leaving the Ministry at this time of day he had sacrificed his lunch in the canteen, and he was aware that there was no food in the kitchen except a hunk of dark-coloured bread which had got to be saved for tomorrow's breakfast. He took down from the shelf a bottle of colourless liquid with a plain white label marked VICTORY GIN. It gave off a sickly, oily smell, as of Chinese ricespirit. Winston poured out nearly a teacupful, nerved himself for a shock, and gulped it down like a dose of medicine.
🔷 Instantly his face turned scarlet and the water ran out of his eyes. The stuff was like nitric acid, and moreover, in swallowing it one had the sensation of being hit on the back of the head with a rubber club. The next moment, however, the burning in his belly died down and the world began 

# TikToken=

### length_function=

In [28]:
"""
기본적으로 모든 splitter 들은 텍스트의 length 를 계산해서
한 덩어리(chunk) 의 크기를 알아낸다.
그 작업에는 파이썬 표준 라이브러리가 지원하는 표준 len() 함수를 사용한다. (디폴트)

Splitter 에는 length 를 계산하는 함수를 제공해줄수도 있다
  바로 length_function= 속성이다
"""
None

In [None]:
splitter = CharacterTextSplitter(
    separator='\n',  
    chunk_size=600,   
    chunk_overlap=100,
    length_function=len,
)


In [None]:
# 디폴트로 len() 함수가 동작함. CharacterTextSplitter 에선 '글자의 개수'를 chunk 카운트 함.
# 그러나 LLM 에서 말하는 token 은 문자(letter) 와는 다르다.
# 어떤 경우에는 문자 두개, 혹은 세개...  가 한개의 token 으로 카운트 된다.

### 참고] OpenAI Tokenizer 예시

In [None]:
"""
OpenAI 에서의 token 예시
https://platform.openai.com/tokenizer
↓ model 의 관점에서, 몇개의 token 을 사용하는지 확인해 볼수 있다.
"""
None

In [29]:
# OpanAI 모델의 tokenizer 를 우리의 splitter 에 사용할수 있다!!!

## from_tiktoken_encoder()

In [30]:
# tiktoken 은 OpenAI 에 의해 만들어진거다.
# https://github.com/openai/tiktoken   <- 아까 위의 Tokenizer 페이지 하단에 보면 이 링크가 있다.

# 아래의 from_tiktoken_encoder() 을 사용하면 tiktoken 패키지가 동작하는 것이다.

splitter = CharacterTextSplitter.from_tiktoken_encoder(
    separator='\n',  
    chunk_size=600,   
    chunk_overlap=100,    
)

# 이제 '모델'이 텍스트를 세는 방법과 'splitter'가 텍스르를 세는 방법이 일치 하게 되었다.
# model 에는 입력 limit 이 있기 때문에, 원하는 텍스트들을 모두 한번에 입력할 수는 없다.
# 그래서 우리 텍스트를 길이 계산할때 model 과 같은 방법으로 계산하는게 더 좋다.


