## Introduction to Semantic Chunking
Text chunking is an essential step in Retrieval-Augmented Generation (RAG), where large text bodies are divided into meaningful segments to improve retrieval accuracy.
Unlike fixed-length chunking, semantic chunking splits text based on the content similarity between sentences.

### Breakpoint Methods:
- **Percentile**: Finds the Xth percentile of all similarity differences and splits chunks where the drop is greater than this value.
- **Standard Deviation**: Splits where similarity drops more than X standard deviations below the mean.
- **Interquartile Range (IQR)**: Uses the interquartile distance (Q3 - Q1) to determine split points.

This notebook implements semantic chunking **using the percentile method** and evaluates its performance on a sample text.


语义分块简介  
文本分块是检索增强生成（RAG）中的关键步骤，该过程将大段文本分割为有意义的片段以提升检索准确性。与固定长度分块不同，语义分块基于句子间的内容相似度对文本进行划分。  

断点方法：  
- **百分位数法**：计算所有相似度差异的第X百分位数，并在相似度降幅大于该值的位置进行分块。  
- **标准差法**：在相似度降幅超过平均值X倍标准差的位置进行分块。  
- **四分位距法（IQR）**：利用四分位距（Q3 - Q1）确定分块断点。  

本笔记本实现了基于百分位数法的语义分块，并在示例文本上对其性能进行评估。

## Setting Up the Environment
We begin by importing necessary libraries.

In [3]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

In [2]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.0


## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [4]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page in mypdf:
        # Extract text from the current page and add spacing
        all_text += page.get_text("text") + " "

    # Return the extracted text, stripped of leading/trailing whitespace
    return all_text.strip()

# Define the path to the PDF file
pdf_path = "AI_Information.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Print the first 500 characters of the extracted text
print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


In [5]:
print(extracted_text)

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and symbolic methods. The 1980s saw a rise in exp

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [6]:
import os
from openai import OpenAI

# 显式设置环境变量
os.environ["OPENAI_API_KEY"] = "sk-xxxx"

# 初始化客户端
client = OpenAI(
    base_url="http://4xxxx0/v1/"
)

## Creating Sentence-Level Embeddings
We split text into sentences and generate embeddings.

In [None]:
!pip install transformers sentence-transformers

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# 加载模型和分词器
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-en-icl")
model = AutoModel.from_pretrained("BAAI/bge-en-icl")

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/640 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/22.2k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.89G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/8.56G [00:00<?, ?B/s]

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tqdm/notebook.py", line 250, in __iter__
    for obj in it:
  File "/usr/local/lib/python3.11/dist-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)
  File "/usr/lib/python3.11/threading.py", line 327,

TypeError: object of type 'NoneType' has no len()

In [None]:
def get_embedding(text, model="BAAI/bge-en-icl"):
    """
    Creates an embedding for the given text using OpenAI.

    Args:
    text (str): Input text.
    model (str): Embedding model name.

    Returns:
    np.ndarray: The embedding vector.
    """
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)

# Splitting text into sentences (basic split)
sentences = extracted_text.split(". ")

# Generate embeddings for each sentence
embeddings = [get_embedding(sentence) for sentence in sentences]

print(f"Generated {len(embeddings)} sentence embeddings.")

InternalServerError: Error code: 503 - {'error': {'message': '当前分组 default 下对于模型 BAAI/bge-en-icl 无可用渠道 (request id: 20250610183948959284034Xkm1WYB3)', 'type': 'new_api_error'}}

我看到你在使用OpenAI API生成文本嵌入时遇到了问题。这段代码尝试使用一个名为"BAAI/bge-en-icl"的模型，但这个模型实际上是Hugging Face上的开源模型，并非OpenAI提供的模型。这就导致了代码无法正常工作。

要解决这个问题，有两种方案可供你选择：

### 方案一：改用OpenAI官方支持的嵌入模型

你可以把模型替换成OpenAI官方支持的嵌入模型，比如text-embedding-ada-002。

```python
def get_embedding(text, model="text-embedding-ada-002"):
    """
    使用OpenAI创建给定文本的嵌入向量。

    参数:
    text (str): 输入文本。
    model (str): 嵌入模型名称。

    返回:
    np.ndarray: 嵌入向量。
    """
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)

# 将文本拆分为句子（基础拆分方法）
sentences = extracted_text.split(". ")

# 为每个句子生成嵌入向量
embeddings = [get_embedding(sentence) for sentence in sentences]

print(f"已生成{len(embeddings)}个句子嵌入向量。")
```

### 方案二：使用Hugging Face的transformers库调用BAAI/bge-en-icl模型

若你想继续使用BAAI/bge-en-icl模型，就得安装transformers库，然后直接从Hugging Face加载模型。

```python
!pip install transformers sentence-transformers

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# 加载模型和分词器
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-en-icl")
model = AutoModel.from_pretrained("BAAI/bge-en-icl")

def get_embedding(text, model=model, tokenizer=tokenizer):
    """
    使用BAAI/bge-en-icl模型创建给定文本的嵌入向量。

    参数:
    text (str): 输入文本。
    model: 嵌入模型。
    tokenizer: 分词器。

    返回:
    np.ndarray: 嵌入向量。
    """
    # 对输入文本进行分词
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    
    # 获取模型输出
    with torch.no_grad():
        outputs = model(**inputs)
    
    # 获取[CLS]标记的嵌入向量
    embedding = outputs.last_hidden_state[:, 0]
    
    # 对嵌入向量进行归一化
    embedding = torch.nn.functional.normalize(embedding, p=2, dim=1)
    
    return embedding.numpy().flatten()

# 将文本拆分为句子（基础拆分方法）
sentences = extracted_text.split(". ")

# 为每个句子生成嵌入向量
embeddings = [get_embedding(sentence) for sentence in sentences]

print(f"已生成{len(embeddings)}个句子嵌入向量。")
```

推荐你使用方案一，因为它更简单，而且能直接利用你已有的OpenAI API设置。要是你需要特定于BAAI/bge-en-icl模型的功能，那就选择方案二。

In [None]:


def get_embedding(text, model=model, tokenizer=tokenizer):
    """
    使用BAAI/bge-en-icl模型创建给定文本的嵌入向量。

    参数:
    text (str): 输入文本。
    model: 嵌入模型。
    tokenizer: 分词器。

    返回:
    np.ndarray: 嵌入向量。
    """
    # 对输入文本进行分词
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)

    # 获取模型输出
    with torch.no_grad():
        outputs = model(**inputs)

    # 获取[CLS]标记的嵌入向量
    embedding = outputs.last_hidden_state[:, 0]

    # 对嵌入向量进行归一化
    embedding = torch.nn.functional.normalize(embedding, p=2, dim=1)

    return embedding.numpy().flatten()

# 将文本拆分为句子（基础拆分方法）
sentences = extracted_text.split(". ")

# 为每个句子生成嵌入向量
embeddings = [get_embedding(sentence) for sentence in sentences]

print(f"已生成{len(embeddings)}个句子嵌入向量。")

In [7]:
def get_embedding(text, model="text-embedding-ada-002"):
    """
    使用OpenAI创建给定文本的嵌入向量。

    参数:
    text (str): 输入文本。
    model (str): 嵌入模型名称。

    返回:
    np.ndarray: 嵌入向量。
    """
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)

# 将文本拆分为句子（基础拆分方法）
sentences = extracted_text.split(". ")

# 为每个句子生成嵌入向量
embeddings = [get_embedding(sentence) for sentence in sentences]

print(f"已生成{len(embeddings)}个句子嵌入向量。")

已生成257个句子嵌入向量。


In [8]:
print(embeddings)

[array([-0.00475738, -0.01582001, -0.00472895, ..., -0.03386393,
       -0.01206718, -0.02161984]), array([ 0.00536695, -0.00395171,  0.02410443, ..., -0.01725438,
       -0.00191769, -0.02882192]), array([-0.0106483 , -0.02341698,  0.02013039, ..., -0.01671127,
       -0.01691006,  0.00356821]), array([ 0.00833709, -0.02698889,  0.00248174, ..., -0.00602015,
       -0.00862791, -0.01127769]), array([ 0.00523212, -0.02207471, -0.00211056, ...,  0.01335559,
       -0.0164852 , -0.01508138]), array([-0.0180023 , -0.01927195, -0.03097946, ..., -0.01447401,
       -0.00218346, -0.00511201]), array([-0.01893867, -0.0036094 ,  0.01427968, ..., -0.00925876,
       -0.02541388, -0.01905712]), array([-0.00675107, -0.01651996,  0.01324848, ..., -0.02968389,
       -0.01224038,  0.00241946]), array([-0.02398104, -0.00161015,  0.0234759 , ..., -0.01411744,
       -0.02512426,  0.00894636]), array([-0.00634062,  0.00997347, -0.00035746, ..., -0.01520668,
       -0.01207351, -0.01079728]), array([-0

In [9]:
print(sentences)

['Understanding Artificial Intelligence \nChapter 1: Introduction to Artificial Intelligence \nArtificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \nto perform tasks commonly associated with intelligent beings', 'The term is frequently applied to \nthe project of developing systems endowed with the intellectual processes characteristic of \nhumans, such as the ability to reason, discover meaning, generalize, or learn from past \nexperience', 'Over the past few decades, advancements in computing power and data availability \nhave significantly accelerated the development and deployment of AI', '\nHistorical Context \nThe idea of artificial intelligence has existed for centuries, often depicted in myths and fiction', '\nHowever, the formal field of AI research began in the mid-20th century', 'The Dartmouth Workshop \nin 1956 is widely considered the birthplace of AI', 'Early AI research focused on problem-solving \nand symbolic methods', 

## Calculating Similarity Differences
We compute cosine similarity between consecutive sentences.

In [10]:
def cosine_similarity(vec1, vec2):
    """
    Computes cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): First vector.
    vec2 (np.ndarray): Second vector.

    Returns:
    float: Cosine similarity.
    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Compute similarity between consecutive sentences
similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]

## Implementing Semantic Chunking
We implement three different methods for finding breakpoints.


以下是对整个代码笔记本的详细讲解，按照章节顺序分析功能、实现逻辑及可能的优化点：


### **一、环境设置与依赖安装**
#### **1. 导入库与安装依赖**
```python
import fitz  # PyMuPDF，用于PDF文本提取
import os
import numpy as np
import json
from openai import OpenAI
!pip install pymupdf  # 安装fitz的依赖
```
- **问题**：后续代码中同时使用了OpenAI API和Hugging Face模型，但OpenAI的`text-embedding-ada-002`和BAAI的`bge-en-icl`模型不兼容，需明确选择一种嵌入方案。


### **二、PDF文本提取**
#### **1. 功能**
从PDF中提取文本，用于后续分块处理。
#### **2. 代码解析**
```python
def extract_text_from_pdf(pdf_path):
    mypdf = fitz.open(pdf_path)  # 打开PDF
    all_text = ""
    for page in mypdf:
        all_text += page.get_text("text") + " "  # 逐页提取文本并拼接
    return all_text.strip()
```
- **注意**：PDF提取可能遇到排版问题（如表格、图片中的文字无法识别），需根据实际PDF结构调整提取方式。


### **三、嵌入模型选择（关键分歧点）**
#### **方案一：使用OpenAI的text-embedding-ada-002**
```python
from openai import OpenAI
client = OpenAI(base_url="http://47.84.70.98:9000/v1/", api_key="your-key")

def get_embedding(text, model="text-embedding-ada-002"):
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)
```
- **优势**：调用简单，无需本地加载模型，适合在线服务。
- **限制**：需API密钥，依赖网络，且`bge-en-icl`非OpenAI模型，此处会报错。

#### **方案二：使用Hugging Face的BAAI/bge-en-icl**
```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-en-icl")
model = AutoModel.from_pretrained("BAAI/bge-en-icl")

def get_embedding(text, model=model, tokenizer=tokenizer):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state[:, 0]  # 取CLS标记作为句向量
    embedding = torch.nn.functional.normalize(embedding, dim=1)  # 归一化提升余弦相似度准确性
    return embedding.numpy().flatten()
```
- **优势**：开源免费，适合本地部署，语义表征能力强（尤其对中文）。
- **注意**：需安装PyTorch和Transformers库，且模型加载需占用内存（BGE模型约3GB）。


### **四、语义分块核心逻辑**
#### **1. 句子拆分与嵌入生成**
```python
sentences = extracted_text.split(". ")  # 简单按句号拆分，可能不准确（如缩写问题）
embeddings = [get_embedding(sentence) for sentence in sentences]
```
- **优化点**：使用更精准的句子分割工具（如NLTK的`sent_tokenize`）处理复杂标点。

#### **2. 计算句子间余弦相似度**
```python
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarities = [cosine_similarity(embeddings[i], embeddings[i+1]) for i in range(len(embeddings)-1)]
```
- **逻辑**：相邻句子相似度越低，越可能是分块断点。

#### **3. 断点检测方法**
```python
def compute_breakpoints(similarities, method="percentile", threshold=90):
    if method == "percentile":
        threshold_value = np.percentile(similarities, threshold)  # 取相似度的阈值百分位
    elif method == "standard_deviation":
        threshold_value = np.mean(similarities) - threshold * np.std(similarities)  # 低于均值X倍标准差
    elif method == "interquartile":
        q1, q3 = np.percentile(similarities, [25, 75])
        threshold_value = q1 - 1.5 * (q3 - q1)  # IQR检测异常值（类似箱线图）
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]
```
- **参数含义**：`threshold`在百分位法中是分位数（如90表示前10%最小相似度为断点），在标准差法中是倍数（如threshold=2表示低于均值2倍标准差）。

#### **4. 分块生成**
```python
def split_into_chunks(sentences, breakpoints):
    chunks = []
    start = 0
    for bp in breakpoints:
        chunks.append(". ".join(sentences[start:bp+1]) + ".")  # 拼接句子为块
        start = bp + 1
    chunks.append(". ".join(sentences[start:]))  # 处理剩余句子
    return chunks
```
- **逻辑**：根据断点列表将句子划分为连续块，每个块包含从`start`到`breakpoint`的句子。


### **五、语义搜索与响应生成**
#### **1. 基于余弦相似度的检索**
```python
def semantic_search(query, text_chunks, chunk_embeddings, k=5):
    query_embedding = get_embedding(query)
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]
    top_indices = np.argsort(similarities)[-k:][::-1]  # 取相似度最高的k个块
    return [text_chunks[i] for i in top_indices]
```
- **用途**：在分块后的数据中检索与查询最相关的上下文，用于RAG的生成阶段。

#### **2. 调用LLM生成回答**
```python
system_prompt = "严格基于给定上下文回答问题..."
def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response
```
- **注意**：`meta-llama/Llama-3.2-3B-Instruct`是本地模型，但代码中使用`client.chat.completions.create`调用的是OpenAI接口，存在矛盾。实际需根据模型部署方式调整（如用Transformers库本地推理）。


### **六、评估与优化点**
#### **1. 回答评估逻辑**
```python
evaluate_system_prompt = "根据真实答案评估AI回答的准确性..."
evaluation_prompt = f"用户问题: {query}\nAI回答: {ai_response}\n真实答案: {data[0]['ideal_answer']}..."
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)
```
- **局限**：依赖LLM自身评估准确性，可能存在主观性。更可靠的方式是使用ROUGE、BLEU等指标或人工标注。

#### **2. 潜在优化方向**
1. **句子拆分**：使用`spaCy`或`nltk`的句子分割器处理复杂文本。
2. **分块策略**：结合句子长度限制（如每个块不超过512字），避免过长或过短的块。
3. **嵌入模型**：对中文文本使用`BAAI/bge-large-zh`等中文优化模型，提升语义匹配精度。
4. **断点参数调优**：通过交叉验证确定最优阈值（如网格搜索寻找最佳百分位数）。
5. **异常处理**：添加对空文本、单句文本的容错逻辑。


### **七、常见错误与解决方案**
1. **OpenAI API密钥问题**：
   - 错误：`OpenAIError: api_key not set`
   - 解决：确保环境变量或代码中正确设置`api_key`，如`os.environ["OPENAI_API_KEY"] = "your-key"`。

2. **模型不存在错误**：
   - 错误：`Unknown model: BAAI/bge-en-icl`（在OpenAI接口中调用）
   - 解决：明确区分OpenAI模型与Hugging Face模型，使用对应库调用。

3. **内存不足**：
   - 现象：加载大型模型时崩溃
   - 解决：使用更小的模型（如`BAAI/bge-small-en`）或启用模型量化（`load_in_4bit=True`）。


### **总结**
该笔记本实现了RAG流程中的语义分块核心模块，通过计算句子间语义相似度动态划分文本块，相比固定长度分块更贴合内容逻辑。关键要点包括：
- **嵌入模型选择**：根据是否需要付费、网络环境决定使用OpenAI或开源模型。
- **分块算法**：百分位法、标准差法等通过统计阈值自动识别语义断层。
- **端到端流程**：从PDF提取→分块→检索→生成回答，完整展示RAG的基本框架。

实际应用中需根据数据特性（如语言、领域）和计算资源进一步调优，确保分块粒度与检索准确性的平衡。

In [11]:
def compute_breakpoints(similarities, method="percentile", threshold=90):
    """
    Computes chunking breakpoints based on similarity drops.

    Args:
    similarities (List[float]): List of similarity scores between sentences.
    method (str): 'percentile', 'standard_deviation', or 'interquartile'.
    threshold (float): Threshold value (percentile for 'percentile', std devs for 'standard_deviation').

    Returns:
    List[int]: Indices where chunk splits should occur.
    """
    # Determine the threshold value based on the selected method
    if method == "percentile":
        # Calculate the Xth percentile of the similarity scores
        threshold_value = np.percentile(similarities, threshold)
    elif method == "standard_deviation":
        # Calculate the mean and standard deviation of the similarity scores
        mean = np.mean(similarities)
        std_dev = np.std(similarities)
        # Set the threshold value to mean minus X standard deviations
        threshold_value = mean - (threshold * std_dev)
    elif method == "interquartile":
        # Calculate the first and third quartiles (Q1 and Q3)
        q1, q3 = np.percentile(similarities, [25, 75])
        # Set the threshold value using the IQR rule for outliers
        threshold_value = q1 - 1.5 * (q3 - q1)
    else:
        # Raise an error if an invalid method is provided
        raise ValueError("Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.")

    # Identify indices where similarity drops below the threshold value
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]

# Compute breakpoints using the percentile method with a threshold of 90
breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)

## Splitting Text into Semantic Chunks
We split the text based on computed breakpoints.

In [12]:
def split_into_chunks(sentences, breakpoints):
    """
    Splits sentences into semantic chunks.

    Args:
    sentences (List[str]): List of sentences.
    breakpoints (List[int]): Indices where chunking should occur.

    Returns:
    List[str]: List of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    start = 0  # Initialize the start index

    # Iterate through each breakpoint to create chunks
    for bp in breakpoints:
        # Append the chunk of sentences from start to the current breakpoint
        chunks.append(". ".join(sentences[start:bp + 1]) + ".")
        start = bp + 1  # Update the start index to the next sentence after the breakpoint

    # Append the remaining sentences as the last chunk
    chunks.append(". ".join(sentences[start:]))
    return chunks  # Return the list of chunks

# Create chunks using the split_into_chunks function
text_chunks = split_into_chunks(sentences, breakpoints)

# Print the number of chunks created
print(f"Number of semantic chunks: {len(text_chunks)}")

# Print the first chunk to verify the result
print("\nFirst text chunk:")
print(text_chunks[0])


Number of semantic chunks: 231

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings.


In [14]:
print(text_chunks)

['Understanding Artificial Intelligence \nChapter 1: Introduction to Artificial Intelligence \nArtificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \nto perform tasks commonly associated with intelligent beings.', 'The term is frequently applied to \nthe project of developing systems endowed with the intellectual processes characteristic of \nhumans, such as the ability to reason, discover meaning, generalize, or learn from past \nexperience.', 'Over the past few decades, advancements in computing power and data availability \nhave significantly accelerated the development and deployment of AI.', '\nHistorical Context \nThe idea of artificial intelligence has existed for centuries, often depicted in myths and fiction.', '\nHowever, the formal field of AI research began in the mid-20th century.', 'The Dartmouth Workshop \nin 1956 is widely considered the birthplace of AI.', 'Early AI research focused on problem-solving \nand symbolic meth

## Creating Embeddings for Semantic Chunks
We create embeddings for each chunk for later retrieval.

In [13]:
def create_embeddings(text_chunks):
    """
    Creates embeddings for each text chunk.

    Args:
    text_chunks (List[str]): List of text chunks.

    Returns:
    List[np.ndarray]: List of embedding vectors.
    """
    # Generate embeddings for each text chunk using the get_embedding function
    return [get_embedding(chunk) for chunk in text_chunks]

# Create chunk embeddings using the create_embeddings function
chunk_embeddings = create_embeddings(text_chunks)

In [15]:
print(chunk_embeddings)

[array([-0.00371943, -0.01289066, -0.00390493, ..., -0.03116395,
       -0.01306672, -0.02264983]), array([ 0.00560299, -0.00215363,  0.02486771, ..., -0.01725488,
        0.00036897, -0.02846087]), array([-0.0141247 , -0.02703131,  0.02554109, ..., -0.01500588,
       -0.02196456,  0.007017  ]), array([ 0.00534353, -0.02718944,  0.00507395, ..., -0.0026493 ,
       -0.01086679, -0.00620684]), array([ 0.00203084, -0.02890852, -0.00070833, ...,  0.01431442,
       -0.01747987, -0.00914674]), array([-0.01971043, -0.02060457, -0.02461503, ..., -0.01554218,
       -0.00428988, -0.00288293]), array([-0.02417807, -0.00379669,  0.01462235, ..., -0.00676645,
       -0.03037353, -0.01439921]), array([-0.01099151, -0.01921565,  0.02029401, ..., -0.02821933,
       -0.01112793,  0.00394317]), array([-0.02620415, -0.00278679,  0.02576228, ..., -0.01462184,
       -0.02521329,  0.01031696]), array([-0.01274966,  0.00713901,  0.00230258, ..., -0.01722746,
       -0.01713361, -0.00905615]), array([-0

## Performing Semantic Search
We implement cosine similarity to retrieve the most relevant chunks.

In [16]:
def semantic_search(query, text_chunks, chunk_embeddings, k=5):
    """
    Finds the most relevant text chunks for a query.

    Args:
    query (str): Search query.
    text_chunks (List[str]): List of text chunks.
    chunk_embeddings (List[np.ndarray]): List of chunk embeddings.
    k (int): Number of top results to return.

    Returns:
    List[str]: Top-k relevant chunks.
    """
    # Generate an embedding for the query
    query_embedding = get_embedding(query)

    # Calculate cosine similarity between the query embedding and each chunk embedding
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]

    # Get the indices of the top-k most similar chunks
    top_indices = np.argsort(similarities)[-k:][::-1]

    # Return the top-k most relevant text chunks
    return [text_chunks[i] for i in top_indices]

In [18]:
# Load the validation data from a JSON file
with open('val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[0]['question']

# Get top 2 relevant chunks
top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)

# Print the query
print(f"Query: {query}")

# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
    print(f"Context {i+1}:\n{chunk}\n{'='*40}")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:

Transparency and Explainability 
Transparency and explainability are essential for building trust in AI systems.
Context 2:

Explainable AI (XAI) 
Explainable AI (XAI) aims to make AI systems more transparent and understandable. Research in 
XAI focuses on developing methods for explaining AI decisions, enhancing trust, and improving 
accountability.


## Generating a Response Based on Retrieved Chunks

In [19]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "meta-llama/Llama-2-7B-chat-hf".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)

InternalServerError: Error code: 503 - {'error': {'message': '当前分组 default 下对于模型 meta-llama/Llama-3.2-3B-Instruct 无可用渠道 (request id: 202506102106583647058870uTZ0wDX)', 'type': 'new_api_error'}}


这个错误表明服务器无法找到你指定的模型 `meta-llama/Llama-3.2-3B-Instruct`，可能是因为你使用的 OpenAI 兼容服务（`http://47.84.70.98:9000/v1/`）未部署该模型，或模型名称拼写错误。以下是具体的解决思路和步骤：


### **错误原因分析**
1. **模型未部署**：  
   你调用的 API 服务（`base_url`）可能仅支持部分预定义模型（如 OpenAI 官方模型或特定开源模型），而 `meta-llama/Llama-3.2-3B-Instruct` 未在该服务中部署。
2. **模型名称错误**：  
   检查模型名称拼写是否正确（如是否包含 `-`、大小写是否匹配）。Llama 系列模型的标准名称通常为 `meta-llama/Llama-2-XXB-chat-hf` 格式，`Llama-3` 可能尚未广泛部署。
3. **服务权限问题**：  
   该服务可能对模型访问有分组限制（如错误信息中的 `default 分组无可用渠道`），需确认你有权限使用该模型。


### **解决方案**
#### **方案一：更换为服务支持的模型**
询问服务提供商获取支持的模型列表，或尝试使用 OpenAI 官方支持的模型（如 `gpt-3.5-turbo`、`gpt-4`）。  
**修改代码**：
```python
def generate_response(system_prompt, user_message, model="gpt-3.5-turbo"):  # 替换为可用模型
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response
```

#### **方案二：本地部署 Llama 模型并调用**
若需使用 `meta-llama` 系列模型，可在本地或私有服务器部署，通过 Transformers 库直接调用，无需通过 OpenAI API。  
**步骤如下**：
1. **安装依赖**：  
   ```bash
   !pip install transformers torch accelerate
   ```
2. **加载模型并生成回答**：  
   ```python
   from transformers import AutoTokenizer, AutoModelForCausalLM

   # 加载模型和分词器（以 Llama-2-7B-Chat 为例，需根据实际模型调整）
   model_name = "meta-llama/Llama-2-7B-chat-hf"
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   model = AutoModelForCausalLM.from_pretrained(
       model_name,
       load_in_8bit=True,  # 启用 8 位量化减少内存占用
       device_map="auto"
   )

   def generate_response(system_prompt, user_message):
       # 构建 prompt（Llama-2 需要特定格式，参考官方示例）
       prompt = f"[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{user_message} [/INST]"
       inputs = tokenizer(prompt, return_tensors="pt").to("cuda")  # 使用 GPU 加速

       with torch.no_grad():
           outputs = model.generate(
               inputs,
               max_new_tokens=200,
               temperature=0.7,
               stop_sequence=["[INST]"]
           )

       return tokenizer.decode(outputs[0], skip_special_tokens=True)
   ```

#### **方案三：确认模型名称与服务兼容性**
- 若服务支持 `Llama-3` 模型，确保名称正确（如是否为 `meta-llama/Llama-3-3B-Instruct` 或其他变体）。
- 联系服务提供商确认模型部署情况，或尝试使用其他开源模型（如 `mistral-7b-instruct`、`falcon-7b`）。


### **关键注意事项**
1. **模型权限与分组**：  
   错误信息中的 `default 分组` 可能提示需要切换分组或申请模型使用权限，需联系平台管理员。
2. **模型量化与设备**：  
   Llama 系列模型对算力要求较高，建议使用 GPU（如 NVIDIA A100）并启用量化（`load_in_4bit/8bit`）以降低内存占用。
3. **prompt 格式**：  
   部分开源模型（如 Llama-2）需要特定的 prompt 格式（如 `[INST]` 和 `[/INST]` 包裹内容），需参考官方文档调整。


### **总结**
此错误的核心原因是目标模型未在指定服务中可用。若需使用开源大模型，推荐通过本地部署或私有 API 服务调用；若依赖第三方 API，需确保模型名称正确且服务支持该模型。根据你的使用场景，方案一（切换为兼容模型）是最快捷的解决方式，方案二（本地部署）则适合对模型可控性要求高的场景。

In [20]:
def generate_response(system_prompt, user_message, model="gpt-3.5-turbo"):  # 替换为可用模型
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [22]:
# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)

In [23]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

Score: 0.5
The AI response provides a general understanding of Explainable AI (XAI) and its importance, but it lacks some key details present in the true response, such as the focus on providing insights into AI decision-making and ensuring fairness in AI systems.


In [24]:
print(evaluation_response)

ChatCompletion(id='chatcmpl-Bgt2HGhtNkAgaoE0ahtIluQA3JLhf', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Score: 0.5\nThe AI response provides a general understanding of Explainable AI (XAI) and its importance, but it lacks some key details present in the true response, such as the focus on providing insights into AI decision-making and ensuring fairness in AI systems.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})], created=1749561033, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier=None, system_fingerprint='fp_0165350fbb', usage=CompletionUsage(completion_tokens=56, prompt_tokens=277, total_tokens=333, completion_tokens_details=N