# Exploring RAG in LangChain

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/12-RAG/03-RAG-Advanced.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/12-RAG/03-RAG-Advanced.ipynb)
![rag-1.png](./assets/12-rag-rag-basic-pdf-rag-process-01.png)

![rag-2.png](./assets/12-rag-rag-basic-pdf-rag-process-02.png)

## OverView

This tutorial explores the entire process of indexing, retrieval, and generation using LangChain's RAG framework. It provides a broad overview of a typical RAG application pipeline and demonstrates how to effectively retrieve and generate responses by using LangChain's key features, such as data loaders, vector databases, embedding, retrievers, and generators, structured in a modular design.

### 1. Question Processing

The question processing stage involves receiving a user's question, handling it, and finding relevant data. The following components are required for this process:

- **Data Source Connection**
To find answers to the question, it is necessary to connect to various text data sources. LangChain helps you easily establish connections to various data sources.
- **Data Indexing and Retrieval**
To efficiently find relevant information from data sources, the data must be indexed. LangChain automates the indexing process and provides tools to retrieve data related to the user's question.


### 2. Answer Generation

Once the relevant data is found, the next step is to generate an answer based on it. The following components are essential for this stage:

- **Answer Generation Model**
LangChain uses advanced natural language processing (NLP) models to generate answers from the retrieved data. These models take the user's question and the retrieved data as input and generate an appropriate answer.


## 架構

本教程將建立一個典型的 RAG 應用程式，如 [Q&A 介紹](https://python.langchain.com/docs/tutorials/) 中所述。這包含兩個主要組件：

- **索引** : 從來源收集數據並建立索引的管道。_此過程通常離線進行。_

- **檢索和生成** : 實際的 RAG 鏈即時處理用戶查詢，從索引中檢索相關數據，並將其傳遞給模型。

從原始數據到生成答案的整個工作流程如下：

### 索引

![](https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png)

- 索引圖片來源: https://python.langchain.com/docs/tutorials/rag/

1. **載入** : 第一步是載入數據。為此，我們將使用 [文檔載入器](https://python.langchain.com/docs/integrations/document_loaders/)。

2. **分割** : [文本分割器](https://python.langchain.com/docs/concepts/text_splitters/) 將大型 ```Documents``` 分割成較小的塊。
這對於數據索引和傳遞給模型很有用，因為大塊可能難以檢索，且可能不適合模型有限的上下文窗口。

3. **存儲** : 分割的數據需要存儲並索引在某個位置以供未來檢索。這通常使用 [向量存儲](https://python.langchain.com/docs/concepts/vectorstores/) 和 [嵌入](https://python.langchain.com/docs/integrations/text_embedding/) 模型來完成。

### 檢索和生成

![](https://python.langchain.com/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png)

- 檢索和生成圖片來源: https://python.langchain.com/docs/tutorials/rag/

1. **檢索** : 當提供用戶輸入時，使用 [檢索器](https://python.langchain.com/docs/integrations/retrievers/) 從數據存儲中檢索相關塊。

2. **生成** : [聊天模型](https://python.langchain.com/docs/integrations/chat/) / [LLM](https://python.langchain.com/docs/integrations/llms/) 使用包含問題和檢索數據的提示生成答案。

**架構詳細說明：**

### RAG 系統的核心優勢

**1. 知識更新能力**
- 無需重新訓練模型即可更新知識庫
- 支援即時添加新文檔和信息
- 避免模型知識截止日期的限制

**2. 可追溯性和可解釋性**
- 每個答案都可以追溯到具體的來源文檔
- 用戶可以驗證答案的準確性和可靠性
- 提供透明的推理過程

**3. 領域特化**
- 可以針對特定領域或企業知識進行優化
- 支援私有數據和專業內容
- 保持領域專業性和準確性

### 技術實現細節

**索引階段的關鍵考量：**
- **數據品質**：清理和預處理原始文檔
- **分塊策略**：平衡信息完整性和檢索效率
- **向量化**：選擇適合的嵌入模型和參數

**檢索階段的優化：**
- **相似性搜尋**：基於語義相似性找到最相關的內容
- **重排序**：進一步優化檢索結果的相關性
- **上下文管理**：控制傳遞給生成模型的信息量

這種架構設計確保了 RAG 系統既能利用大型語言模型的生成能力，又能提供準確、可靠的領域知識。

## 練習使用的文檔

歐洲人工智能方法 - 政策觀點

- 作者：數位啟蒙論壇在 EIT Digital 指導下，並獲得 EIT Manufacturing、EIT Urban Mobility、EIT Health 和 EIT Climate-KIC 的貢獻支持
- 連結：https://eit.europa.eu/news-events/news/european-approach-artificial-intelligence-policy-perspective
- 檔案名稱：**A European Approach to Artificial Intelligence - A Policy Perspective.pdf**

_請將下載的檔案複製到 **data** 資料夾中進行練習。_

### Table of Contents

- [Overview](#overview)
- [Document Used for Practice](#document-used-for-practice)
- [Environment Setup](#environment-setup)
- [Explore Each Module](#explore-each-module)
- [Step 1: Load Document](#step-1:-load-document)
- [Step 2: Split Documents](#step-2:-split-documents)
- [Step 3: Embedding](#step-3:-embedding)
- [Step 4: Create Vectorstore](#step-4-create-vectorstore)
- [Step 5: Create Retriever ](#step-5-create-retriever)
- [Step 6: Create Prompt](#step-6-create-prompt)
- [Step 7: Create LLM](#step-7-create-llm)


### References

- [LangChain: Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)
- [LangChain: Text splitters](https://python.langchain.com/docs/concepts/text_splitters/)
- [LangChain: Vector Store](https://python.langchain.com/docs/concepts/vectorstores/)
- [LangChain: Embeddings](https://python.langchain.com/docs/integrations/text_embedding/)
- [LangChain: Retriever](https://python.langchain.com/docs/integrations/retrievers/)
- [LangChain: Chat Models](https://python.langchain.com/docs/integrations/chat/)
- [LangChain: LLM](https://python.langchain.com/docs/integrations/llms/)
- [Langchain: Indexing](https://python.langchain.com/docs/tutorials/rag/)
- [Langchain: Retrieval and Generation](https://python.langchain.com/docs/tutorials/rag/)
- [Semantic Similarity Splitter](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html)
- [OpenAI API Model List / Pricing](https://openai.com/api/pricing/)
- [HuggingFace LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
---

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "bs4",
        "faiss-cpu",
        "pypdf",
        "pypdf2"
        "unstructured",
        "unstructured[pdf]",
        "fastembed",
        "chromadb",
        "rank_bm25",
        "langsmith",
        "langchain",
        "langchain_text_splitters",
        "langchain_community",
        "langchain_core",
        "langchain_openai",
        "langchain_experimental"
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "HUGGINGFACEHUB_API_TOKEN": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "03-RAG-Advanced",
    }
)

Environment variables have been set successfully.


Environment variables have been set successfully.
You can alternatively set API keys, such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Explore Each Module
The following are the modules used in this content.

In [5]:
import bs4
from langchain import hub
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

以下是使用基本 RAG 模型處理網頁（```WebBaseLoader```）的範例。

在每個步驟中，您都可以配置各種選項或應用新技術。
如果在使用 ```WebBaseLoader``` 時由於未設定 ```USER_AGENT``` 而顯示警告，

請在 ```.env``` 檔案中添加 ```USER_AGENT = myagent```。

In [6]:
# Step 1: Load Documents
# Load the contents of news articles, split them into chunks, and index them.
url = "https://www.forbes.com/sites/rashishrivastava/2024/05/21/the-prompt-scarlett-johansson-vs-openai/"
loader = WebBaseLoader(
    web_paths=(url,),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["article-body fs-article fs-premium fs-responsive-text current-article font-body color-body bg-base font-accent article-subtype__masthead",
                             "header-content-container masthead-header__container"]},
        )
    ),
)
docs = loader.load()


# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

splits = text_splitter.split_documents(docs)

# Step 3: Embedding & Create Vectorstore
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings(model="text-embedding-3-small"))

# Step 4: retriever
# Retrieve and generate information contained in the news.
retriever = vectorstore.as_retriever()

# Step 5: Create Prompt
prompt = hub.pull("rlm/rag-prompt")

# Step 6: Create LLM
# Generate the language model (LLM).
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)


def format_docs(docs):
    # Combine the retrieved document results into a single paragraph.
    return "\n\n".join(doc.page_content for doc in docs)


# Create Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Step 8: Run Chain
# Input queries about the documents and output answers.
question = "Why did OpenAI and Scarlett Johansson have a conflict?"
response = rag_chain.invoke(question)

# output the results.
print(f"URL: {url}")
print(f"Number of documents: {len(docs)}")
print("===" * 20)
print(f"[HUMAN]\n{question}\n")
print(f"[AI]\n{response}")

URL: https://www.forbes.com/sites/rashishrivastava/2024/05/21/the-prompt-scarlett-johansson-vs-openai/
Number of documents: 1
[HUMAN]
Why did OpenAI and Scarlett Johansson have a conflict?

[AI]
Scarlett Johansson and OpenAI had a conflict over a voice for ChatGPT that sounded similar to her own, which she claimed was created without her consent. After declining an offer to voice the AI, Johansson expressed shock and anger when the voice was used in a demo shortly thereafter. Her lawyers demanded details on the voice's creation and requested its removal, while OpenAI stated it was not an imitation of her voice.


In [7]:
print(docs)

[Document(metadata={'source': 'https://www.forbes.com/sites/rashishrivastava/2024/05/21/the-prompt-scarlett-johansson-vs-openai/'}, page_content="ForbesInnovationEditors' PickThe Prompt: Scarlett Johansson Vs OpenAIPlus AI-generated kids draw predators on TikTok and Instagram. \nShare to FacebookShare to TwitterShare to Linkedin“I was shocked, angered and in disbelief,” Scarlett Johansson said about OpenAI's Sky voice for ChatGPT that sounds similar to her own.FilmMagic\nThe Prompt is a weekly rundown of AI’s buzziest startups, biggest breakthroughs, and business deals. To get it in your inbox, subscribe here.\n\n\nWelcome back to The Prompt.\n\nScarlett Johansson’s lawyers have demanded that OpenAI take down a voice for ChatGPT that sounds much like her own after she’d declined to work with the company to create it. The actress said in a statement provided to Forbes that her lawyers have asked the AI company to detail the “exact processes” it used to create the voice, which sounds eer

## Step 1: Load Document

- [Link to official documentation - Document loaders](https://python.langchain.com/docs/integrations/document_loaders/)


### 網頁

```WebBaseLoader``` 使用 ```bs4.SoupStrainer``` 來僅解析指定網頁中的必要部分。

[注意]

- ```bs4.SoupStrainer``` 可以方便地從網頁中提取所需元素

(範例)

```python
bs4.SoupStrainer(
    "div",
    attrs={"class": ["newsct_article _article_body", "media_end_head_title"]}, # 輸入類別名稱。
)

bs4.SoupStrainer(
    "article",
    attrs={"id": ["dic_area"]}, # 輸入類別名稱。
)
```

**詳細說明：**

### bs4.SoupStrainer 的核心功能

**1. 選擇性解析**
- 只解析網頁中特定的 HTML 元素
- 大幅提高解析效率和速度
- 減少記憶體使用量

**2. 精確定位**
- **標籤選擇**：指定 HTML 標籤類型（如 `div`、`article`、`p`）
- **屬性篩選**：根據 class、id 等屬性進行篩選
- **內容過濾**：排除不相關的網頁內容

**3. 常見使用模式**

```python
# 根據 class 屬性篩選
bs4.SoupStrainer(
    "div",
    attrs={"class": "content-body"}
)

# 根據 id 屬性篩選
bs4.SoupStrainer(
    "section",
    attrs={"id": "main-content"}
)

# 多個 class 值
bs4.SoupStrainer(
    "p",
    attrs={"class": ["paragraph", "text-content"]}
)
```

**4. 實際應用場景**
- **新聞網站**：提取文章正文，排除廣告和側邊欄
- **部落格**：獲取文章內容，忽略評論和導航
- **學術網站**：提取論文摘要或主要內容

**5. 在 RAG 系統中的優勢**
- **提高品質**：只提取相關內容，減少雜訊
- **節省資源**：降低處理和存儲成本
- **增強檢索**：更精確的內容提高檢索準確性

這種精準的內容提取對於建立高品質的 RAG 知識庫至關重要。

Here is another example, a BBC news article. Try running it!

In [8]:
# Load the contents of the news article, split it into chunks, and index it.
loader = WebBaseLoader(
    web_paths=("https://www.bbc.com/news/business-68092814",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "main",
            attrs={"id": ["main-content"]},
        )
    ),
)
docs = loader.load()
print(f"Number of documents: {len(docs)}")
docs[0].page_content[:500]

Number of documents: 1


'Could AI \'trading bots\' transform the world of investing?Getty ImagesIt is hard for both humans and computers to predict stock market movementsSearch for "AI investing" online, and you\'ll be flooded with endless offers to let artificial intelligence manage your money.I recently spent half an hour finding out what so-called AI "trading bots" could apparently do with my investments.Many prominently suggest that they can give me lucrative returns. Yet as every reputable financial firm warns - your '

### PDF
The following section covers the document loader for importing **PDF** files.

In [9]:
from langchain.document_loaders import PyPDFLoader

# Load PDF file. Enter the file path.
loader = PyPDFLoader("data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf")



docs = loader.load()
print(f"Number of documents: {len(docs)}")

# Output the content of the 10th page.
print(f"\n[page_content]\n{docs[9].page_content[:500]}")
print(f"\n[metadata]\n{docs[9].metadata}\n")

Number of documents: 24

[page_content]
A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
10
requirements becomes mandatory in all sectors and create bar -
riers especially for innovators and SMEs. Public procurement ‘data 
sovereignty clauses’ induce large players to withdraw from AI for 
urban ecosystems. Strict liability sanctions block AI in healthcare, 
while limiting space of self-driving experimentation. The support 
measures to boost European AI are not sufficient to offset the 
unintended effect of generic

[metadata]
{'source': 'data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 9}



### CSV
The following section covers the document loader for importing CSV files.

CSV retrieves data using row numbers instead of page numbers.

In [10]:
from langchain_community.document_loaders.csv_loader import CSVLoader

# Load CSV file
loader = CSVLoader(file_path="data/titanic.csv")
docs = loader.load()
print(f"Number of documents: {len(docs)}")

# Output the content of the 10th row.
print(f"\n[row_content]\n{docs[9].page_content[:500]}")
print(f"\n[metadata]\n{docs[9].metadata}\n")

Number of documents: 20

[row_content]
PassengerId: 10
Survived: 1
Pclass: 2
Name: Nasser, Mrs. Nicholas (Adele Achem)
Sex: female
Age: 14
SibSp: 1
Parch: 0
Ticket: 237736
Fare: 30.0708
Cabin: 
Embarked: C

[metadata]
{'source': 'data/titanic.csv', 'row': 9}



### TXT
The following section covers the document loader for importing TXT files.

In [11]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/appendix-keywords_eng.txt", encoding="utf-8")
docs = loader.load()
print(f"Number of documents: {len(docs)}")

# Output the content of the 10th page.
print(f"\n[page_content]\n{docs[0].page_content[:500]}")
print(f"\n[metadata]\n{docs[0].metadata}\n")

Number of documents: 1

[page_content]
- Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching to understand the meaning of the user’s query and return relevant results.
Example: When a user searches for "planets in the solar system," it returns information about related planets such as "Jupiter" or "Mars."
Keywords: Natural Language Processing, Search Algorithm, Data Mining

- Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, int

[metadata]
{'source': 'data/appendix-keywords_eng.txt'}



### Load all files in the folder

Here is an example of loading all ```.txt``` files in the folder.


In [12]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(".", glob="data/*.txt", show_progress=True)
docs = loader.load()

print(f"Number of documents: {len(docs)}")

# Output the content of the 10th page.
print(f"\n[page_content]\n{docs[0].page_content[:500]}")
print(f"\n[metadata]\n{docs[0].metadata}\n")
print(f"\n[metadata]\n{docs[1].metadata}\n")

100%|██████████| 2/2 [00:08<00:00,  4.26s/it]

Number of documents: 2

[page_content]
Selecting the “right” amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries generat

[metadata]
{'source': 'data/chain-of-density.txt'}


[metadata]
{'source': 'data/appendix-keywords_eng.txt'}






The following is an example of loading all ```.pdf``` files in the folder.

In [13]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(".", glob="data/*.pdf")
docs = loader.load()

print(f"page_content: {len(docs)}\n")
print("[metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview] Front Section =========\n")
print(docs[0].page_content[2500:3000])

page_content: 1

[metadata]

{'source': 'data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf'}


While a clear cut definition of Artificial Intelligence (AI) would be the building block for its regulatory and governance framework, there is not yet a widely accepted definition of what AI is (Buiten, 2019; Scherer, 2016). Definitions focussing on intelligence are often circular in that defining what level of intelligence is nee- ded to qualify as ‘artificial intelligence’ remains subjective and situational1. Pragmatic ostensive definitions simply group under the AI labels a wide array of tech


### Python

The following is an example of loading ```.py``` files.

In [14]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PythonLoader

loader = DirectoryLoader(".", glob="**/*.py", loader_cls=PythonLoader)
docs = loader.load()

print(f"page_content: {len(docs)}\n")
print("[metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview] Front Section =========\n")
print(docs[0].page_content[:500])

page_content: 1

[metadata]

{'source': 'data/audio_utils.py'}


import re
import os
from pytube import YouTube
from moviepy.editor import AudioFileClip, VideoFileClip
from pydub import AudioSegment
from pydub.silence import detect_nonsilent


def extract_abr(abr):
    youtube_audio_pattern = re.compile(r"\d+")
    kbps = youtube_audio_pattern.search(abr)
    if kbps:
        kbps = kbps.group()
        return int(kbps)
    else:
        return 0


def get_audio_filepath(filename):
    # Create the audio folder if it doesn't exist
    if not os.path.isdir("au


---


## Step 2: Split Documents

It splits the document into small chunks.

In [15]:
# Load the content of the news article, split it into chunks, and index it.
loader = WebBaseLoader(
    web_paths=("https://www.bbc.com/news/business-68092814",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "main",
            attrs={"id": ["main-content"]},
        )
    ),
)
docs = loader.load()
print(f"Number of Documents: {len(docs)}")
docs[0].page_content[:500]

Number of Documents: 1


'Could AI \'trading bots\' transform the world of investing?Getty ImagesIt is hard for both humans and computers to predict stock market movementsSearch for "AI investing" online, and you\'ll be flooded with endless offers to let artificial intelligence manage your money.I recently spent half an hour finding out what so-called AI "trading bots" could apparently do with my investments.Many prominently suggest that they can give me lucrative returns. Yet as every reputable financial firm warns - your '

### CharacterTextSplitter

這是最簡單的方法。它基於字符（預設："\n\n"）分割文本，並以字符數量來測量塊大小。

1. **文本分割方式**：按單個字符單位。
2. **塊大小測量方式**：按字符的 ```len``` 長度。

視覺化範例：https://chunkviz.up.railway.app/

```CharacterTextSplitter``` 類別提供將文本分割成指定大小塊的功能。

- ```separator``` 參數指定用於分隔塊的字串，在此情況下使用兩個換行字符（"\n\n"）。
- ```chunk_size``` 決定每個塊的最大長度。
- ```chunk_overlap``` 指定相鄰塊之間重疊的字符數量。
- ```length_function``` 定義用於計算塊長度的函數，預設為 ```len``` 函數，返回字串長度。
- ```is_separator_regex``` 是一個布林值，決定 ```separator``` 是否被解釋為正則表達式。

**詳細說明：**

### CharacterTextSplitter 的核心特性

**1. 簡單直觀的分割邏輯**
- 按指定的分隔符進行分割
- 不考慮語意邊界，純粹基於字符計算
- 適合結構化文本或格式統一的內容

**2. 參數配置的實際應用**

```python
# 基本配置
splitter = CharacterTextSplitter(
    separator="\n\n",      # 段落分隔
    chunk_size=1000,       # 每塊最多1000字符
    chunk_overlap=200,     # 相鄰塊重疊200字符
    length_function=len,   # 使用字符數計算長度
    is_separator_regex=False
)
```

**3. 重疊機制的重要性**
- **上下文保持**：確保重要信息不會在分割邊界處丟失
- **語意連接**：維持前後文的關聯性
- **檢索優化**：提高相關內容的檢索機會

**4. 適用場景**
- **結構化文檔**：法律文件、技術手冊
- **格式統一內容**：新聞文章、部落格文章
- **快速原型**：RAG 系統的初期開發和測試

**5. 限制和考量**
- **語意完整性**：可能在句子中間切斷
- **語言特性**：對不同語言的處理效果差異
- **文檔結構**：無法識別標題、列表等邏輯結構

**6. 最佳實踐**
- 根據內容類型調整分隔符
- 設定適當的重疊比例（通常 10-20%）
- 考慮目標模型的上下文窗口大小

這種簡單的分割方法雖然基礎，但在許多實際應用中仍然有效且實用。

In [16]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)

此函數使用 ```text_splitter``` 物件的 ```create_documents``` 方法將給定的文本（```state_of_the_union```）分割成多個文檔，並將結果存儲在 ```texts``` 變數中。然後輸出 texts 中的第一個文檔。此過程可視為處理和分析文本數據的初始步驟，特別適用於將大型文本數據分割成可管理的塊。

**詳細說明：**

### create_documents 方法的工作原理

**1. 文檔分割流程**
```python
# 基本使用模式
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])  # 顯示第一個分割後的文檔
```

**2. 輸入和輸出格式**
- **輸入**：文本字串列表 `[state_of_the_union]`
- **輸出**：`Document` 物件列表，每個物件包含：
  - `page_content`：分割後的文本內容
  - `metadata`：相關元數據（如來源、位置等）

**3. 實際應用價值**

**文本預處理**
- 將冗長的原始文本轉換為適合處理的小塊
- 每個塊都是獨立的 `Document` 物件
- 便於後續的向量化和索引操作

**記憶體管理**
- 避免一次性處理超大文本檔案
- 允許逐塊處理，提高系統效率
- 減少記憶體佔用和處理時間

**RAG 系統整合**
- 為向量資料庫準備合適大小的文本塊
- 確保每個塊都能完整表達一個概念或主題
- 提高檢索的精確度和相關性

**4. 典型的處理流程**
```python
# 1. 分割文檔
texts = text_splitter.create_documents([original_text])

# 2. 檢查分割結果
print(f"總共分割出 {len(texts)} 個文檔塊")
print(f"第一個塊的長度：{len(texts[0].page_content)}")

# 3. 後續處理
for i, doc in enumerate(texts):
    print(f"文檔 {i+1}: {doc.page_content[:100]}...")
```

**5. 在文本分析中的重要性**
- **可管理性**：將大型文檔分解為可處理的單元
- **並行處理**：支援多核心或分散式處理
- **品質控制**：便於檢查和驗證分割結果的合理性

這種分割方法是 RAG 系統和其他文本分析應用的基礎步驟，確保後續處理的有效性和準確性。

In [17]:
# Load a portion of the "Chain of Density" paper.
with open("data/chain-of-density.txt", "r", encoding="utf-8") as f:
    text = f.read()[:500]

In [18]:
text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separator="\n\n"
)
text_splitter.split_text(text)

['Selecting the “right” amount of information to include in a summary is a difficult task. \nA good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries genera']

In [19]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10, separator="\n")
text_splitter.split_text(text)

['Selecting the “right” amount of information to include in a summary is a difficult task.',
 'A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries genera']

In [20]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10, separator=" ")
text_splitter.split_text(text)

['Selecting the “right” amount of information to include in a summary is a difficult task. \nA good',
 'A good summary should be detailed and entity-centric without being overly dense and hard to follow.',
 'to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with',
 'with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial',
 'an initial entity-sparse summary before iteratively incorporating missing salient entities without',
 'without increasing the length. Summaries genera']

In [21]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0, separator=" ")
text_splitter.split_text(text)

['Selecting the “right” amount of information to include in a summary is a difficult task. \nA good',
 'summary should be detailed and entity-centric without being overly dense and hard to follow. To',
 'better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to',
 'as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary',
 'before iteratively incorporating missing salient entities without increasing the length. Summaries',
 'genera']

In [22]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separator=" ")
# Split the text file into chunks.
text_splitter.split_text(text)

# Split the document into chunks.
split_docs = text_splitter.split_documents(docs)
len(split_docs)

8

In [23]:
split_docs[0]

Document(metadata={'source': 'https://www.bbc.com/news/business-68092814'}, page_content='Could AI \'trading bots\' transform the world of investing?Getty ImagesIt is hard for both humans and computers to predict stock market movementsSearch for "AI investing" online, and you\'ll be flooded with endless offers to let artificial intelligence manage your money.I recently spent half an hour finding out what so-called AI "trading bots" could apparently do with my investments.Many prominently suggest that they can give me lucrative returns. Yet as every reputable financial firm warns - your capital may be at risk.Or putting it more simply - you could lose your money - whether it is a human or a computer that is making stock market decisions on your behalf.Yet such has been the hype about the ability of AI over the past few years, that almost one in three investors would be happy to let a trading bot make all the decisions for them, according to one 2023 survey in the US.John Allan says inve

In [24]:
# Load the content of the news article, split it into chunks, and index it.
loader = WebBaseLoader(
    web_paths=("https://www.bbc.com/news/business-68092814",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "main",
            attrs={"id": ["main-content"]},
        )
    ),
)

# Define the splitter.
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separator=" ")

# Split the document while loading it.
split_docs = loader.load_and_split(text_splitter=text_splitter)
print(f"Number of documents: {len(docs)}")
docs[0].page_content[:500]

Number of documents: 1


'Could AI \'trading bots\' transform the world of investing?Getty ImagesIt is hard for both humans and computers to predict stock market movementsSearch for "AI investing" online, and you\'ll be flooded with endless offers to let artificial intelligence manage your money.I recently spent half an hour finding out what so-called AI "trading bots" could apparently do with my investments.Many prominently suggest that they can give me lucrative returns. Yet as every reputable financial firm warns - your '

### RecursiveTextSplitter
這個文本分割器是推薦用於一般文本的。

1. ```文本分割方式```：基於分隔符列表。
2. ```塊大小測量方式```：按字符的 len 長度。

```RecursiveCharacterTextSplitter``` 類別提供遞歸分割文本的功能。此類別接受參數，如 ```chunk_size``` 來指定要分割的塊的大小，```chunk_overlap``` 來定義相鄰塊之間的重疊大小，```length_function``` 來計算塊的長度，以及 ```is_separator_regex``` 來指示分隔符是否為正則表達式。

在範例中，塊大小設定為 100，重疊大小為 20，長度計算函數為 ```len```，```is_separator_regex``` 設定為 ```False``` 以指示分隔符不是正則表達式。

**詳細說明：**

### RecursiveCharacterTextSplitter 的優勢

**1. 智能分割策略**
- 使用**分隔符優先級列表**進行分割
- 預設分隔符順序：`["\n\n", "\n", " ", ""]`
- 優先保持段落和句子的完整性

**2. 遞歸分割機制**
```python
# 分割邏輯流程
1. 嘗試用 "\n\n" (段落分隔) 分割
2. 如果塊仍太大，用 "\n" (行分隔) 進一步分割
3. 如果還是太大，用 " " (空格) 分割
4. 最後用 "" (字符) 強制分割
```

**3. 語意保持特性**
- **段落完整性**：優先保持段落不被切斷
- **句子完整性**：盡量避免在句子中間分割
- **詞語完整性**：最後才考慮字符級分割

**4. 配置參數說明**
```python
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,        # 目標塊大小
    chunk_overlap=20,      # 重疊字符數（20%重疊率）
    length_function=len,   # 長度計算方式
    is_separator_regex=False,  # 分隔符不是正則表達式
    separators=["\n\n", "\n", " ", ""]  # 分隔符優先級
)
```

**5. 與 CharacterTextSplitter 的比較**
| 特性 | RecursiveCharacterTextSplitter | CharacterTextSplitter |
|------|-------------------------------|----------------------|
| 分割智能度 | 高（多層次分隔符） | 低（單一分隔符） |
| 語意保持 | 好（優先保持段落句子完整） | 一般（可能切斷句子） |
| 適用場景 | 一般文本、自然語言 | 結構化文本 |
| 處理複雜度 | 中等 | 簡單 |

**6. 最佳實踐建議**
- **chunk_size**：根據模型上下文窗口設定（通常 500-2000 字符）
- **chunk_overlap**：設定為 chunk_size 的 10-20%
- **自定義分隔符**：可根據文檔類型調整分隔符優先級

**7. 實際應用場景**
- **技術文檔**：保持程式碼區塊和說明的完整性
- **學術論文**：維持段落和論點的邏輯結構
- **小說文本**：保持對話和描述的連貫性
- **新聞文章**：維持段落和引用的完整性

這種遞歸分割方法在大多數文本處理場景中都能提供更好的結果，是 RAG 系統的推薦選擇。

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)

In [26]:
# Load a portion of the "Chain of Density" paper.
with open("data/chain-of-density.txt", "r", encoding="utf-8") as f:
    text = f.read()[:500]

In [27]:
character_text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separator=" "
)
for sent in character_text_splitter.split_text(text):
    print(sent)
print("===" * 20)
recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=10
)
for sent in recursive_text_splitter.split_text(text):
    print(sent)

Selecting the “right” amount of information to include in a summary is a difficult task. 
A good
A good summary should be detailed and entity-centric without being overly dense and hard to follow.
to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with
with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial
an initial entity-sparse summary before iteratively incorporating missing salient entities without
without increasing the length. Summaries genera
Selecting the “right” amount of information to include in a summary is a difficult task.
A good summary should be detailed and entity-centric without being overly dense and hard to follow.
follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what
with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an
an initial entity-sparse summary before iteratively incorporating missing s

- Attempts to split the given document sequentially using the specified list of separators.
- Attempts splitting in order until the chunks are sufficiently small. The default list is ["\n\n", "\n", " ", ""].
- This generally has the effect of keeping all paragraphs (as well as sentences and words) as long as possible, while appearing to be the most semantically relevant pieces of text.


In [28]:
# Check the default separators specified in recursive_text_splitter.
recursive_text_splitter._separators

['\n\n', '\n', ' ', '']

### 語義相似性

基於語義相似性進行文本分割。

來源：[SemanticChunker](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html)

從高層次來看，該過程包括將文本分割成句子，將它們分組為三句一組，然後在嵌入空間中合併相似的句子。

**詳細說明：**

### SemanticChunker 的核心原理

**1. 語義驅動的分割策略**
- 不依賴字符數或固定分隔符
- 基於內容的**語義相似性**進行智能分組
- 確保每個塊在主題上具有內聚性

**2. 三階段處理流程**

**階段一：句子分割**
```python
# 將原始文本分解為獨立句子
text → ["句子1", "句子2", "句子3", ...]
```

**階段二：三句分組**
```python
# 創建滑動窗口組合
["句子1", "句子2", "句子3"]  # 組合1
["句子2", "句子3", "句子4"]  # 組合2
["句子3", "句子4", "句子5"]  # 組合3
```

**階段三：語義合併**
```python
# 在向量空間中計算相似度
embedding_1 = embed("句子1-2-3")
embedding_2 = embed("句子2-3-4")
similarity = cosine_similarity(embedding_1, embedding_2)

# 根據相似度閾值決定是否合併
if similarity > threshold:
    merge_chunks()
```

**3. 核心優勢**

**語義連貫性**
- 確保相關內容聚集在同一塊中
- 避免主題在分割邊界處被切斷
- 提高 RAG 檢索的準確性

**自適應分割**
- 根據內容密度動態調整塊大小
- 複雜主題可能產生較大的塊
- 簡單內容形成較小的塊

**上下文保持**
- 三句重疊設計確保上下文連續性
- 重要概念不會在邊界處丟失
- 改善語言模型的理解能力

**4. 與傳統方法的比較**

| 特性 | SemanticChunker | RecursiveCharacterTextSplitter |
|------|----------------|-------------------------------|
| 分割依據 | 語義相似性 | 字符數/分隔符 |
| 塊大小 | 動態（內容驅動） | 固定（參數控制） |
| 主題完整性 | 高 | 中等 |
| 計算成本 | 高（需要嵌入計算） | 低 |

**5. 適用場景**

**高品質 RAG 系統**
- 需要精確主題匹配的問答系統
- 複雜技術文檔的智能檢索
- 學術研究的語義搜索

**多主題文檔**
- 包含多個不相關主題的長文檔
- 新聞文章集合的主題分類
- 企業知識庫的內容組織

**6. 實施考量**

**嵌入模型選擇**
- 需要高品質的句子嵌入模型
- 考慮語言特性和領域適應性
- 平衡準確性和計算效率

**相似度閾值調整**
- 過高：產生過多小塊，可能丟失上下文
- 過低：產生過大塊，降低檢索精度
- 需要根據具體應用進行調優

這種基於語義的分割方法雖然計算成本較高，但在需要高品質檢索的 RAG 應用中能夠顯著提升性能。

In [29]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Create a SemanticChunker.
semantic_text_splitter = SemanticChunker(OpenAIEmbeddings(model="text-embedding-3-small"), add_start_index=True)

In [30]:
# Load a portion of the "Chain of Density" paper.
with open("data/chain-of-density.txt", "r", encoding="utf-8") as f:
    text = f.read()

for sent in semantic_text_splitter.split_text(text):
    print(sent)
    print("===" * 20)

Selecting the “right” amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries generated by CoD are more abstractive, exhibit more fusion, and have less of a lead bias than GPT-4 summaries generated by a vanilla prompt. We conduct a human preference study on 100 CNN DailyMail articles and find that that humans prefer GPT-4 summaries that are more dense than those generated by a vanilla prompt and almost as dense as human written summaries. Qualitative analysis supports the notion that there exists a tradeoff between infor-mativeness and readability. 500 annotated CoD summaries, a

## Step 3: Embedding

- [Link to official documentation - Embedding](https://python.langchain.com/docs/integrations/text_embedding)


### Paid Embeddings (OpenAI)

It uses OpenAI's embedding model, which is a paid service.

In [31]:
from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings

# Step 3: Create Embeddings & Vectorstore
# Generate the vector store.
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings(model="text-embedding-3-small"))

Below is a list of Embedding models supported by ```OpenAI``` :

The default model is ```text-embedding-ada-002``` .


| MODEL                  | ROUGH PAGES PER DOLLAR | EXAMPLE PERFORMANCE ON MTEB EVAL |
| ---------------------- | ---------------------- | -------------------------------- |
| text-embedding-3-small | 62,500                 | 62.3%                            |
| text-embedding-3-large | 9,615                  | 64.6%                            |
| text-embedding-ada-002 | 12,500                 | 61.0%                            |


In [32]:
vectorstore = FAISS.from_documents(
    documents=splits, embedding=OpenAIEmbeddings(model="text-embedding-3-small")
)

### 免費開源基礎的嵌入模型
1. HuggingFaceEmbeddings（預設模型：sentence-transformers/all-mpnet-base-v2）
2. FastEmbedEmbeddings

**注意**
- 使用嵌入模型時，請確保驗證您使用的語言是否受支援。

**詳細說明：**

### 開源嵌入模型的選擇

**1. HuggingFaceEmbeddings**

**預設模型特性：**
- **all-mpnet-base-v2**：高品質的通用句子嵌入模型
- **多語言支援**：主要針對英語優化，其他語言性能較低
- **維度**：768 維向量
- **適用場景**：一般文本、問答系統、語義搜索

**配置選項：**
```python
from langchain_community.embeddings import HuggingFaceEmbeddings

# 使用預設模型
embeddings = HuggingFaceEmbeddings()

# 自定義模型
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",  # 更輕量的選擇
    model_kwargs={'device': 'cpu'},  # 或 'cuda' 使用 GPU
    encode_kwargs={'normalize_embeddings': True}
)
```

**2. FastEmbedEmbeddings**

**核心優勢：**
- **高效能**：專為速度優化的嵌入計算
- **輕量級**：較小的模型大小和記憶體佔用
- **快速部署**：簡化的安裝和配置流程

**適用場景：**
- 需要快速響應的應用
- 資源受限的環境
- 大規模文本處理

### 語言支援考量

**1. 中文支援的替代方案**
```python
# 針對中文優化的模型
chinese_embeddings = HuggingFaceEmbeddings(
    model_name="shibing624/text2vec-base-chinese"
)

# 多語言模型
multilingual_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)
```

**2. 模型性能比較**

| 模型 | 語言支援 | 模型大小 | 性能 | 推薦用途 |
|------|----------|----------|------|----------|
| all-mpnet-base-v2 | 英語 | ~420MB | 高 | 英語 RAG 系統 |
| all-MiniLM-L6-v2 | 英語 | ~80MB | 中 | 輕量英語應用 |
| paraphrase-multilingual-* | 多語言 | ~220MB | 中高 | 多語言應用 |
| text2vec-base-chinese | 中文 | ~400MB | 高 | 中文 RAG 系統 |

**3. 最佳實踐建議**

**語言匹配**
- 確保嵌入模型與文檔語言匹配
- 考慮多語言混合文檔的處理策略
- 測試不同模型在特定領域的表現

**性能優化**
- 使用 GPU 加速（如果可用）
- 批量處理提升效率
- 考慮模型量化減少記憶體使用

**品質驗證**
- 在實際數據上測試嵌入品質
- 比較不同模型的檢索性能
- 根據應用需求選擇合適的模型

**4. 實際部署注意事項**
- **初次載入時間**：模型下載和初始化可能需要時間
- **記憶體需求**：確保系統有足夠記憶體載入模型
- **網路依賴**：首次使用需要網路下載模型
- **版本相容性**：注意不同版本間的相容性問題

選擇合適的開源嵌入模型是建立高效 RAG 系統的關鍵步驟，需要平衡性能、資源消耗和語言支援等因素。

In [33]:
from langchain_huggingface import HuggingFaceEmbeddings

# Generate the vector store. (Default model: sentence-transformers/all-mpnet-base-v2)
vectorstore = FAISS.from_documents(
    documents=splits, embedding=HuggingFaceEmbeddings()
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [34]:
# %pip install fastembed

In [35]:
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

vectorstore = FAISS.from_documents(documents=splits, embedding=FastEmbedEmbeddings())

## 步驟 4：創建向量存儲

創建向量存儲是指從文檔生成向量嵌入並將其存儲在資料庫中的過程。

**詳細說明：**

### 向量存儲的核心概念

**1. 向量化過程**
```python
文檔文本 → 嵌入模型 → 數值向量 → 向量資料庫
```

**2. 向量存儲的組成要素**

**文檔內容**
- 原始文本內容（page_content）
- 相關元數據（metadata）
- 文檔來源和位置信息

**向量表示**
- 高維數值向量（通常 512-1536 維）
- 語義信息的數學表示
- 支援快速相似性搜索

**索引結構**
- 高效的搜索算法（如 HNSW、IVF）
- 相似性計算優化
- 分散式存儲支援

### 常見向量資料庫選項

**1. 內存型（開發測試）**
```python
# FAISS - 高效能向量搜索
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(docs, embeddings)

# Chroma - 輕量級選擇
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(docs, embeddings)
```

**2. 持久化（生產環境）**
```python
# Pinecone - 託管服務
from langchain_community.vectorstores import Pinecone
vectorstore = Pinecone.from_documents(docs, embeddings, index_name="my-index")

# Weaviate - 開源方案
from langchain_community.vectorstores import Weaviate
vectorstore = Weaviate.from_documents(docs, embeddings)
```

### 創建過程的關鍵步驟

**1. 文檔預處理**
- 確保文檔格式正確
- 驗證文檔內容和元數據
- 處理特殊字符和編碼問題

**2. 批量向量化**
```python
# 批量處理提升效率
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    vectors = embeddings.embed_documents([doc.page_content for doc in batch])
    vectorstore.add_documents(batch)
```

**3. 索引優化**
- 選擇適合的索引算法
- 調整索引參數提升搜索性能
- 考慮記憶體和磁碟使用平衡

### 性能和擴展性考量

**1. 存儲容量規劃**
- 估算向量存儲空間需求
- 規劃擴展策略和容量增長
- 考慮數據備份和災難恢復

**2. 查詢性能優化**
- 調整相似性搜索參數
- 實施快取策略減少重複計算
- 使用適當的距離度量方法

**3. 分散式部署**
- 支援橫向擴展的資料庫選擇
- 負載平衡和高可用性設計
- 數據一致性和同步機制

### 品質保證措施

**1. 向量品質驗證**
```python
# 測試向量生成
test_query = "sample query"
query_vector = embeddings.embed_query(test_query)
print(f"向量維度: {len(query_vector)}")
print(f"向量範圍: {min(query_vector)} 到 {max(query_vector)}")
```

**2. 搜索準確性測試**
```python
# 驗證檢索結果
results = vectorstore.similarity_search(test_query, k=5)
for i, doc in enumerate(results):
    print(f"結果 {i+1}: {doc.page_content[:100]}...")
```

向量存儲的品質直接影響 RAG 系統的檢索效果，因此在創建過程中需要細心處理每個環節，確保最終的檢索性能和準確性。

In [36]:
from langchain_community.vectorstores import FAISS

# Apply FAISS DB
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings(model="text-embedding-3-small"))

In [37]:
from langchain_community.vectorstores import Chroma

# Apply Chroma DB
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings(model="text-embedding-3-small"))

## 步驟 5：創建檢索器

檢索器是一個接口，當給定非結構化查詢時返回文檔。

檢索器不需要存儲文檔；它只返回（或檢索）文檔。

- [官方文檔連結 - 檢索器](https://python.langchain.com/docs/integrations/retrievers/)

**檢索器**是通過對生成的向量存儲使用 ```invoke()``` 方法創建的。

**補充說明：**

### 檢索器的核心功能

**1. 查詢處理**
- 接收自然語言查詢
- 轉換為向量表示
- 在向量空間中搜索相似文檔

**2. 文檔返回**
```python
# 基本檢索器創建
retriever = vectorstore.as_retriever()

# 帶參數的檢索器
retriever = vectorstore.as_retriever(
    search_type="similarity",  # 搜索類型
    search_kwargs={"k": 3}     # 返回前3個最相關文檔
)
```

**3. 搜索策略選項**
- **similarity**：基於相似度的搜索
- **mmr**：最大邊際相關性，減少重複
- **similarity_score_threshold**：設定相似度閾值

這個接口設計讓檢索器可以靈活地與不同的向量存儲後端配合工作。

### Similarity Retrieval

- The default setting is ```similarity``` , which uses cosine similarity.


In [38]:
question = "Why did OpenAI and Scarlett Johansson have a conflict?"

retriever = vectorstore.as_retriever(search_type="similarity")
search_result = retriever.invoke(question)
print(search_result)

[Document(metadata={'source': 'https://www.forbes.com/sites/rashishrivastava/2024/05/21/the-prompt-scarlett-johansson-vs-openai/'}, page_content="ForbesInnovationEditors' PickThe Prompt: Scarlett Johansson Vs OpenAIPlus AI-generated kids draw predators on TikTok and Instagram. \nShare to FacebookShare to TwitterShare to Linkedin“I was shocked, angered and in disbelief,” Scarlett Johansson said about OpenAI's Sky voice for ChatGPT that sounds similar to her own.FilmMagic\nThe Prompt is a weekly rundown of AI’s buzziest startups, biggest breakthroughs, and business deals. To get it in your inbox, subscribe here.\n\n\nWelcome back to The Prompt.\n\nScarlett Johansson’s lawyers have demanded that OpenAI take down a voice for ChatGPT that sounds much like her own after she’d declined to work with the company to create it. The actress said in a statement provided to Forbes that her lawyers have asked the AI company to detail the “exact processes” it used to create the voice, which sounds eer

The ```similarity_score_threshold``` returns only the results with a ```score_threshold``` or higher in similarity-based retrieval.

In [39]:
question = "Why did OpenAI and Scarlett Johansson have a conflict?"

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}
)
search_result = retriever.invoke(question)
print(search_result)



[]


Search using the ```maximum marginal search result(mmr)``` .


In [40]:
question = "Why did OpenAI and Scarlett Johansson have a conflict?"

retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 2})
search_result = retriever.invoke(question)
print(search_result)



[Document(metadata={'source': 'https://www.forbes.com/sites/rashishrivastava/2024/05/21/the-prompt-scarlett-johansson-vs-openai/'}, page_content="ForbesInnovationEditors' PickThe Prompt: Scarlett Johansson Vs OpenAIPlus AI-generated kids draw predators on TikTok and Instagram. \nShare to FacebookShare to TwitterShare to Linkedin“I was shocked, angered and in disbelief,” Scarlett Johansson said about OpenAI's Sky voice for ChatGPT that sounds similar to her own.FilmMagic\nThe Prompt is a weekly rundown of AI’s buzziest startups, biggest breakthroughs, and business deals. To get it in your inbox, subscribe here.\n\n\nWelcome back to The Prompt.\n\nScarlett Johansson’s lawyers have demanded that OpenAI take down a voice for ChatGPT that sounds much like her own after she’d declined to work with the company to create it. The actress said in a statement provided to Forbes that her lawyers have asked the AI company to detail the “exact processes” it used to create the voice, which sounds eer

### Create a variety of queries
With ```MultiQueryRetriever```, you can generate similar questions with equivalent meanings based on the original query. This helps diversify question expressions, which can enhance search performance.

In [41]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

question = "Why did OpenAI and Scarlett Johansson have a conflict?"

llm = ChatOpenAI(temperature=0, model="gpt-4o-mini")

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(), llm=llm
)

In [42]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [43]:
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

  unique_docs = retriever_from_llm.get_relevant_documents(query=question)
INFO:langchain.retrievers.multi_query:Generated queries: ['What was the nature of the disagreement between OpenAI and Scarlett Johansson?  ', 'Can you explain the reasons behind the conflict involving OpenAI and Scarlett Johansson?  ', 'What led to the tensions between OpenAI and Scarlett Johansson?']


4

### 集成檢索器
**BM25 檢索器 + 基於嵌入的檢索器**

- ```BM25 檢索器```（關鍵字搜索，稀疏檢索器）：基於 TF-IDF，考慮詞頻和文檔長度正規化。
- ```基於嵌入的檢索器```（上下文搜索，密集檢索器）：將文本轉換為嵌入向量，並基於向量相似性檢索文檔（例如餘弦相似度、點積）。這反映了單詞的語義相似性。
- ```集成檢索器```：結合 BM25 和基於嵌入的檢索器，將關鍵字搜索的詞頻與上下文搜索的語義相似性相結合。

**注意**

TF-IDF（詞頻-逆文檔頻率）：TF-IDF 將在特定文檔中頻繁出現的單詞評估為高度重要，而在所有文檔中頻繁出現的單詞則被認為不太重要。

**重點解釋：**

### 集成檢索的核心優勢

**1. 互補性檢索策略**
- **BM25**：擅長精確關鍵字匹配，處理專有名詞和術語
- **嵌入檢索**：理解語義關係，處理同義詞和概念相關性
- **集成效果**：結合兩者優勢，提高檢索準確性和召回率

**2. TF-IDF 的重要性評估**
- **高頻詞在特定文檔**：表示該詞對此文檔很重要
- **高頻詞在所有文檔**：被視為常見詞，重要性降低
- **實際應用**：「人工智能」在 AI 論文中重要，但「的」、「是」等虛詞重要性低

**3. 實際應用場景**
- **專業術語檢索**：BM25 精確匹配技術名詞
- **概念理解檢索**：嵌入模型理解相關概念
- **混合查詢**：同時滿足精確匹配和語義理解需求

In [44]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

In [45]:
doc_list = [
    "We saw a seal swimming in the ocean.",
    "The seal is clapping its flippers.",
    "Make sure the envelope has a proper seal before sending it.",
    "Every official document requires a seal to authenticate it.",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(doc_list)
bm25_retriever.k = 4

faiss_vectorstore = FAISS.from_texts(doc_list, OpenAIEmbeddings(model="text-embedding-3-small"))
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 4})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

In [46]:
def pretty_print(docs):
    for i, doc in enumerate(docs):
        print(f"[{i+1}] {doc.page_content}")

In [47]:
sample_query = "The seal rested on a rock."
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.invoke(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.invoke(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.invoke(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

[Query]
The seal rested on a rock.

[BM25 Retriever]
[1] The seal is clapping its flippers.
[2] We saw a seal swimming in the ocean.
[3] Every official document requires a seal to authenticate it.
[4] Make sure the envelope has a proper seal before sending it.
[FAISS Retriever]
[1] The seal is clapping its flippers.
[2] We saw a seal swimming in the ocean.
[3] Every official document requires a seal to authenticate it.
[4] Make sure the envelope has a proper seal before sending it.
[Ensemble Retriever]
[1] The seal is clapping its flippers.
[2] We saw a seal swimming in the ocean.
[3] Every official document requires a seal to authenticate it.
[4] Make sure the envelope has a proper seal before sending it.


In [48]:
sample_query = "Ensure the package is securely sealed before handing it to the courier."
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.invoke(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.invoke(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.invoke(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

[Query]
Ensure the package is securely sealed before handing it to the courier.

[BM25 Retriever]
[1] The seal is clapping its flippers.
[2] Every official document requires a seal to authenticate it.
[3] Make sure the envelope has a proper seal before sending it.
[4] We saw a seal swimming in the ocean.
[FAISS Retriever]
[1] Make sure the envelope has a proper seal before sending it.
[2] Every official document requires a seal to authenticate it.
[3] The seal is clapping its flippers.
[4] We saw a seal swimming in the ocean.
[Ensemble Retriever]
[1] The seal is clapping its flippers.
[2] Make sure the envelope has a proper seal before sending it.
[3] Every official document requires a seal to authenticate it.
[4] We saw a seal swimming in the ocean.


In [49]:
sample_query = "The certificate must bear an official seal to be considered valid."
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.invoke(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.invoke(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.invoke(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

[Query]
The certificate must bear an official seal to be considered valid.

[BM25 Retriever]
[1] Every official document requires a seal to authenticate it.
[2] The seal is clapping its flippers.
[3] We saw a seal swimming in the ocean.
[4] Make sure the envelope has a proper seal before sending it.
[FAISS Retriever]
[1] Every official document requires a seal to authenticate it.
[2] Make sure the envelope has a proper seal before sending it.
[3] The seal is clapping its flippers.
[4] We saw a seal swimming in the ocean.
[Ensemble Retriever]
[1] Every official document requires a seal to authenticate it.
[2] The seal is clapping its flippers.
[3] Make sure the envelope has a proper seal before sending it.
[4] We saw a seal swimming in the ocean.


In [50]:
sample_query = "animal"

print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.invoke(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.invoke(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.invoke(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

[Query]
animal

[BM25 Retriever]
[1] Every official document requires a seal to authenticate it.
[2] Make sure the envelope has a proper seal before sending it.
[3] The seal is clapping its flippers.
[4] We saw a seal swimming in the ocean.
[FAISS Retriever]
[1] We saw a seal swimming in the ocean.
[2] The seal is clapping its flippers.
[3] Every official document requires a seal to authenticate it.
[4] Make sure the envelope has a proper seal before sending it.
[Ensemble Retriever]
[1] Every official document requires a seal to authenticate it.
[2] We saw a seal swimming in the ocean.
[3] The seal is clapping its flippers.
[4] Make sure the envelope has a proper seal before sending it.


## Step 6: Create Prompt

提示工程在基於給定數據（```上下文```）獲得期望輸出方面發揮著關鍵作用。

[提示1]

1. 如果 ```檢索器``` 提供的結果中缺少重要信息，您應該修改 ```檢索器``` 邏輯。
2. 如果 ```檢索器``` 的結果包含充足信息，但 llm 無法提取關鍵信息或無法產生期望格式的輸出，您應該調整提示。

[提示2]

1. LangSmith 的 **hub** 包含許多經過驗證的提示。
2. 利用或稍微修改這些經過驗證的提示可以節省成本和時間。

- https://smith.langchain.com/hub/search?q=rag

**重點解釋：**

### 提示工程的診斷策略

**1. 問題定位方法**
- **檢索問題**：相關文檔未被找到 → 調整檢索策略
- **提取問題**：文檔存在但答案不準確 → 優化提示設計

**2. LangSmith Hub 的價值**
- **經過驗證**：社群測試並優化的提示模板
- **快速起步**：避免從零開始設計提示
- **成本效益**：減少試錯時間和 API 調用成本

**3. 提示優化重點**
- 明確指定輸出格式
- 提供具體的角色設定
- 包含處理不確定情況的指導

In [51]:
from langchain import hub

In [52]:
prompt = hub.pull("rlm/rag-prompt")
prompt.pretty_print()


You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: [33;1m[1;3m{question}[0m 
Context: [33;1m[1;3m{context}[0m 
Answer:


## Step 7: Create LLM

Select one of the OpenAI models:

- ```gpt-4o``` : OpenAI GPT-4o model
- ```gpt-4o-mini``` : OpenAI GPT-4o-mini model

For detailed pricing information, please refer to the [OpenAI API Model List / Pricing](https://openai.com/api/pricing/)

In [53]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(temperature=0, model="gpt-4o-mini")

You can check token usage in the following way.

In [54]:
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = model.invoke("Where is the capital of South Korea?")
print(cb)

Tokens Used: 24
	Prompt Tokens: 15
		Prompt Tokens Cached: 0
	Completion Tokens: 9
		Reasoning Tokens: 0
Successful Requests: 1
Total Cost (USD): $7.65e-06


### 使用 Huggingface

您需要 Hugging Face 令牌來訪問 HuggingFace 上的 LLMs。

您可以輕鬆下載並使用 HuggingFace 上提供的開源模型。

您也可以在下面的連結查看每日性能改善的開源排行榜：

- [HuggingFace LLM 排行榜](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

**注意**

Hugging Face 的免費 API 有 10GB 大小限制。
例如，microsoft/Phi-3-mini-4k-instruct 模型為 11GB，無法透過免費 API 訪問。

選擇以下其中一個選項：

1. 選項：使用 Hugging Face 推理端點

透過付費計劃啟用推理端點來執行大規模模型推理。

2. 選項：在本地運行模型

使用 transformers 函式庫在本地環境中運行 microsoft/Phi-3-mini-4k-instruct 模型（建議使用 GPU）。

3. 選項：使用較小的模型。

將模型大小減少到免費 API 支援的大小並執行。

**重點說明：**

### HuggingFace 使用策略

**1. 免費 API 限制**
- **10GB 限制**：許多高性能模型超過此限制
- **推理速度**：免費服務可能較慢
- **使用配額**：每月有一定的調用限制

**2. 模型選擇建議**
- **小型高效模型**：如 microsoft/DialoGPT-medium（1.4GB）
- **多語言模型**：考慮語言支援需求
- **任務特化模型**：選擇針對特定任務優化的模型

**3. 本地部署考量**
- **硬體需求**：GPU 記憶體和計算能力
- **環境配置**：CUDA、PyTorch 等依賴
- **成本效益**：長期使用可能比付費 API 更經濟

In [55]:
# Creating a HuggingFaceEndpoint object
from langchain_huggingface import HuggingFaceEndpoint

repo_id = "microsoft/Phi-3-mini-4k-instruct"

hugging_face_llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    max_new_tokens=256,
    temperature=0.1,
)


In [56]:
hugging_face_llm.invoke("Where is the capital of South Korea?")

'\n\n# Answer\nThe capital of South Korea is Seoul.'

## RAG Template Experiment
This template is a structure for implementing a Retrieval-Augmented Generation (RAG) workflow.

In [57]:
# Step 1: Load Documents
# Load the documents, split them into chunks, and index them.
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Load the PDF file. Enter the file path.
file_path = "data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf"
loader = PyPDFLoader(file_path=file_path)

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

split_docs = loader.load_and_split(text_splitter=text_splitter)

# Step 3, 4: Embeding & Create Vectorstore
embedding = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(documents=split_docs, embedding=embedding)

# Step 5: Create Retriever
# Search for documents that match the user's query.

# Retrieve the top K documents with the highest similarity.
k = 3

# Initialize the (Sparse) BM25 retriever and (Dense) FAISS retriever.
bm25_retriever = BM25Retriever.from_documents(split_docs)
bm25_retriever.k = k

faiss_vectorstore = FAISS.from_documents(split_docs, embedding)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": k})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

# Step 6: Create Prompt

prompt = hub.pull("rlm/rag-prompt")

# Step 7: Create LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)


def format_docs(docs):
    # Combine the retrieved document results into a single paragraph.
    return "\n\n".join(doc.page_content for doc in docs)


# Step 8: Create Chain
rag_chain = (
    {"context": ensemble_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Run Chain: Input a query about the document and output the answer.

question = "Which region's approach to artificial intelligence is the focus of this document?"
response = rag_chain.invoke(question)

# Get Output
print(f"PDF Path: {file_path}")
print(f"Number of documents: {len(split_docs)}")
print("===" * 20)
print(f"[HUMAN]\n{question}\n")
print(f"[AI]\n{response}")

PDF Path: data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf
Number of documents: 86
[HUMAN]
Which region's approach to artificial intelligence is the focus of this document?

[AI]
The focus of this document is on the European approach to artificial intelligence. It discusses the strategies and policies implemented by the European Commission and EU Member States to enhance AI development and governance in Europe. The document emphasizes the importance of trust, data governance, and collaboration in fostering AI innovation within the region.


Document: A European Approach to Artificial Intelligence - A Policy Perspective.pdf

- LangSmith: https://smith.langchain.com/public/0951c102-de61-482b-b42a-6e7d78f02107/r


In [58]:
question = "Which region's approach to artificial intelligence is the focus of this document?"
response = rag_chain.invoke(question)
print(response)


The focus of this document is on the European approach to artificial intelligence. It discusses the strategies and policies implemented by the European Commission and EU Member States to enhance AI development and governance in Europe. The document emphasizes the importance of trust, data governance, and collaboration in fostering AI innovation within the region.


Document: A European Approach to Artificial Intelligence - A Policy Perspective.pdf

- LangSmith: https://smith.langchain.com/public/c968bf7e-e22e-4eb1-a76a-b226eedc6c51/r

In [59]:
question = "What is the primary principle of the European AI approach?"
response = rag_chain.invoke(question)
print(response)

The primary principle of the European AI approach is to place people at the center of AI development, often referred to as "human-centric AI." This approach aims to support technological and industrial capacity, prepare for socio-economic changes, and ensure an appropriate ethical and legal framework. It emphasizes the need for AI to comply with the law, fulfill ethical principles, and be robust to achieve "trustworthy AI."


Ask a question unrelated to the document.

- LangSmith: https://smith.langchain.com/public/d8a49d52-3a63-4206-9166-58605bd990a6/r

In [60]:
question = "What is the obligation of the United States in AI?"
response = rag_chain.invoke(question)
print(response)

The obligation of the United States in AI primarily involves ensuring ethical standards, transparency, and accountability in AI development and deployment. This includes addressing concerns related to privacy, data governance, and the societal impacts of AI technologies. Additionally, the U.S. may need to engage in international cooperation to establish norms and regulations that promote responsible AI use.
