## Evaluating Chunk Sizes in Simple RAG

Choosing the right chunk size is crucial for improving retrieval accuracy in a Retrieval-Augmented Generation (RAG) pipeline. The goal is to balance retrieval performance with response quality.

This section evaluates different chunk sizes by:

1. Extracting text from a PDF.
2. Splitting text into chunks of varying sizes.
3. Creating embeddings for each chunk.
4. Retrieving relevant chunks for a query.
5. Generating a response using retrieved chunks.
6. Evaluating faithfulness and relevancy.
7. Comparing results for different chunk sizes.

## 评估简单RAG中的分块大小  

在检索增强生成（RAG）流程中，选择合适的分块大小对提升检索准确性至关重要，其核心目标是在检索性能与响应质量之间取得平衡。  

本节通过以下步骤评估不同分块大小的效果：  

1. 从PDF中提取文本。  
2. 将文本分割为不同大小的块。  
3. 为每个块创建嵌入向量。  
4. 针对查询检索相关块。  
5. 使用检索到的块生成响应。  
6. 评估响应的忠实度和相关性。  
7. 对比不同分块大小的结果。

## Setting Up the Environment
We begin by importing necessary libraries.

In [None]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

In [None]:
pip install pymuPdf

Collecting pymuPdf
  Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymuPdf
Successfully installed pymuPdf-1.26.1


## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [None]:
import os
from openai import OpenAI

#os.environ["OPxxxxY"] = "sxxxxxxx"

client = OpenAI(
    base_url="hxxxxx"
)

## Extracting Text from the PDF
First, we will extract text from the `AI_Information.pdf` file.

In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page in mypdf:
        # Extract text from the current page and add spacing
        all_text += page.get_text("text") + " "

    # Return the extracted text, stripped of leading/trailing whitespace
    return all_text.strip()

# Define the path to the PDF file
pdf_path = "AI_Information.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Print the first 500 characters of the extracted text
print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


In [None]:
def extract_text_from_pdf(pdf_path):
    """
    从PDF文件中提取文本内容

    Args:
        pdf_path (str): PDF文件的路径

    Returns:
        str: 提取的PDF文本内容（去除首尾空白）
    """
    # 打开PDF文件（使用fitz库，即PyMuPDF）
    mypdf = fitz.open(pdf_path)
    all_text = ""  # 初始化空字符串用于存储全文

    # 遍历PDF的每一页
    for page in mypdf:
        # 提取当前页文本并添加空格分隔（避免跨页文本粘连）
        all_text += page.get_text("text") + " "

    # 返回去除首尾空白的文本
    return all_text.strip()

In [None]:
print(extracted_text)

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and symbolic methods. The 1980s saw a rise in exp

PyMuPDF 默认支持中文文本提取，但需注意：

若 PDF 为扫描件（图片格式），需先使用 OCR 工具（如 Tesseract）处理

复杂排版（如分栏、图文混排）可能导致文本顺序错乱

## Chunking the Extracted Text
To improve retrieval, we split the extracted text into overlapping chunks of different sizes.

In [None]:
def chunk_text(text, n, overlap):
    """
    Splits text into overlapping chunks.

    Args:
    text (str): The text to be chunked.
    n (int): Number of characters per chunk.
    overlap (int): Overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from the current index to the index + chunk size
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

# Define different chunk sizes to evaluate
chunk_sizes = [128, 256, 512]

# Create a dictionary to store text chunks for each chunk size
text_chunks_dict = {size: chunk_text(extracted_text, size, size // 5) for size in chunk_sizes}

# Print the number of chunks created for each chunk size
for size, chunks in text_chunks_dict.items():
    print(f"Chunk Size: {size}, Number of Chunks: {len(chunks)}")

Chunk Size: 128, Number of Chunks: 326
Chunk Size: 256, Number of Chunks: 164
Chunk Size: 512, Number of Chunks: 82


In [None]:
print(text_chunks_dict)

{128: ['Understanding Artificial Intelligence \nChapter 1: Introduction to Artificial Intelligence \nArtificial intelligence (AI) refers t', 'ntelligence (AI) refers to the ability of a digital computer or computer-controlled robot \nto perform tasks commonly associated ', 'asks commonly associated with intelligent beings. The term is frequently applied to \nthe project of developing systems endowed w', 'eloping systems endowed with the intellectual processes characteristic of \nhumans, such as the ability to reason, discover meani', 'to reason, discover meaning, generalize, or learn from past \nexperience. Over the past few decades, advancements in computing po', 'ancements in computing power and data availability \nhave significantly accelerated the development and deployment of AI. \nHistor', 'deployment of AI. \nHistorical Context \nThe idea of artificial intelligence has existed for centuries, often depicted in myths an', 'ften depicted in myths and fiction. \nHowever, the formal 

## Creating Embeddings for Text Chunks
Embeddings convert text into numerical representations for similarity search.

In [None]:
from tqdm import tqdm

def create_embeddings(texts, model="BAAI/bge-en-icl"):
    """
    Generates embeddings for a list of texts.

    Args:
    texts (List[str]): List of input texts.
    model (str): Embedding model.

    Returns:
    List[np.ndarray]: List of numerical embeddings.
    """
    # Create embeddings using the specified model
    response = client.embeddings.create(model=model, input=texts)
    # Convert the response to a list of numpy arrays and return
    return [np.array(embedding.embedding) for embedding in response.data]

# Generate embeddings for each chunk size
# Iterate over each chunk size and its corresponding chunks in the text_chunks_dict
chunk_embeddings_dict = {size: create_embeddings(chunks) for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings")}

Generating Embeddings:   0%|          | 0/3 [00:00<?, ?it/s]

InternalServerError: Error code: 503 - {'error': {'message': '当前分组 default 下对于模型 BAAI/bge-en-icl 无可用渠道 (request id: 20250612184410456769460Qphom7Zw)', 'type': 'new_api_error'}}

### 代码功能详解：文本嵌入向量生成与处理流程

#### 1. 核心函数：`create_embeddings` 文本嵌入生成

```python
def create_embeddings(texts, model="BAAI/bge-en-icl"):
    """
    为文本列表生成语义嵌入向量（用于RAG检索系统）
    
    Args:
        texts (List[str]): 输入文本列表（如分块后的文本）
        model (str): 嵌入模型名称（默认使用BAAI/bge-en-icl）
    
    Returns:
        List[np.ndarray]: 嵌入向量列表，每个向量为numpy数组
    """
    # 调用OpenAI API生成嵌入向量
    response = client.embeddings.create(model=model, input=texts)
    # 将API响应转换为numpy数组并返回
    return [np.array(embedding.embedding) for embedding in response.data]
```

#### 2. 嵌入生成原理解析

- **模型选择**  
  `BAAI/bge-en-icl` 是由北京智源人工智能研究院（BAAI）开发的中文嵌入模型，特点：
  - 支持中文语义表示，适合中文文本检索
  - 采用对比学习训练，向量空间中语义相近的文本距离更近
  - 维度通常为768或1024，具体取决于模型版本

- **API调用逻辑**  
  ```python
  response = client.embeddings.create(model=model, input=texts)
  ```
  - `model`参数指定使用的嵌入模型
  - `input`参数接受文本列表（一次最多处理2048个文本）
  - 响应包含`data`字段，每个元素包含`embedding`数组

- **数据格式转换**  
  ```python
  [np.array(embedding.embedding) for embedding in response.data]
  ```
  - 将API返回的Python列表转换为numpy数组
  - 便于后续进行向量运算（如余弦相似度计算）


#### 3. 主程序：批量生成多尺寸分块的嵌入向量

```python
# 使用tqdm显示进度条
chunk_embeddings_dict = {
    size: create_embeddings(chunks)
    for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings")
}
```

#### 4. 关键执行流程说明

- **并行处理优化**  
  虽然代码未显式使用并行，但OpenAI API会自动批量处理输入文本：
  - 一次调用可处理多个文本（受API限制，通常≤2048）
  - 对于大数量文本块，建议增加分批处理逻辑

- **进度可视化**  
  `tqdm(text_chunks_dict.items(), desc="Generating Embeddings")` 实现：
  - 显示分块大小遍历进度
  - 输出格式示例：`Generating Embeddings: 100%|██████████| 3/3 [00:10<00:00,  3.21s/it]`

- **结果存储结构**  
  `chunk_embeddings_dict` 是字典类型，键为分块大小，值为对应嵌入向量列表：
  ```python
  {
      128: [np.array([...]), np.array([...]), ...],  # 128字节分块的嵌入
      256: [np.array([...]), np.array([...]), ...],  # 256字节分块的嵌入
      512: [np.array([...]), np.array([...]), ...]   # 512字节分块的嵌入
  }
  ```


#### 5. 嵌入向量的实际应用场景

- **RAG检索核心步骤**  
  1. 分块文本 → 2. 生成嵌入 → 3. 构建向量数据库 → 4. 查询时计算向量相似度
  - 嵌入质量直接影响检索相关性

- **向量空间特性**  
  - 语义相近的文本在向量空间中距离更近
  - 可通过余弦相似度衡量文本相关性：
    ```python
    def cosine_similarity(vec1, vec2):
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    ```

- **常见嵌入模型对比**  
  | 模型名称          | 维度   | 语言支持   | 特点                     |
  |-------------------|--------|------------|--------------------------|
  | BAAI/bge-en-icl   | 768    | 中英双语   | 适合检索场景             |
  | text-embedding-ada-002 | 1536 | 多语言     | OpenAI官方模型，通用性强 |
  | Sentence-BERT     | 768    | 多语言     | 优化句子级语义表示       |


#### 6. 潜在问题与优化方案

- **API调用限制**  
  - 问题：OpenAI嵌入API存在调用频率限制（如每分钟300次）
  - 解决方案：添加重试机制和错误处理
    ```python
    import time
    from tenacity import retry, wait_exponential
    
    @retry(wait=wait_exponential(multiplier=1, min=2, max=10))
    def safe_create_embeddings(texts, model):
        return create_embeddings(texts, model)
    ```

- **大批次处理优化**  
  - 问题：一次处理过多文本可能导致API超时
  - 解决方案：分批次调用（如每500个文本一批）
    ```python
    def batch_create_embeddings(texts, model, batch_size=500):
        embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            batch_embeddings = create_embeddings(batch, model)
            embeddings.extend(batch_embeddings)
        return embeddings
    ```

- **本地模型部署**  
  若API调用受限，可部署本地嵌入模型：
  ```python
  from sentence_transformers import SentenceTransformer
  
  # 本地加载模型
  model = SentenceTransformer("BAAI/bge-en-icl")
  
  def local_create_embeddings(texts):
      return model.encode(texts)
  ```


#### 7. 嵌入向量的评估指标

- **内在评估**  
  - 向量空间一致性：相似文本的向量距离是否足够近
  - 聚类效果：同类文本是否能聚为一簇

- **外在评估**  
  - RAG系统中的检索准确率（Recall@K）
  - 生成回答的相关性和忠实度
  - 可通过人工评估或自动化指标（如BLEU、ROUGE）衡量

如果需要进一步优化嵌入生成流程或选择更适合的模型，可以提供具体应用场景和数据特点！

In [None]:
def create_embeddings(texts, model="text-embedding-ada-002"):
    """生成文本嵌入向量"""
    response = client.embeddings.create(model=model, input=texts)
    return [np.array(embedding.embedding) for embedding in response.data]

# 使用修改后的模型生成嵌入
chunk_embeddings_dict = {
    size: create_embeddings(chunks)
    for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings")
}


Generating Embeddings:   0%|          | 0/3 [00:00<?, ?it/s][A
Generating Embeddings:  33%|███▎      | 1/3 [00:15<00:31, 15.94s/it][A
Generating Embeddings:  67%|██████▋   | 2/3 [00:25<00:11, 11.92s/it][A
Generating Embeddings: 100%|██████████| 3/3 [00:30<00:00, 10.14s/it]


## Performing Semantic Search
We use cosine similarity to find the most relevant text chunks for a user query.

In [None]:
def cosine_similarity(vec1, vec2):
    """
    Computes cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): First vector.
    vec2 (np.ndarray): Second vector.

    Returns:
    float: Cosine similarity score.
    """

    # Compute the dot product of the two vectors
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [None]:
def retrieve_relevant_chunks(query, text_chunks, chunk_embeddings, k=5):
    """
    Retrieves the top-k most relevant text chunks.

    Args:
    query (str): User query.
    text_chunks (List[str]): List of text chunks.
    chunk_embeddings (List[np.ndarray]): Embeddings of text chunks.
    k (int): Number of top chunks to return.

    Returns:
    List[str]: Most relevant text chunks.
    """
    # Generate an embedding for the query - pass query as a list and get first item
    query_embedding = create_embeddings([query])[0]

    # Calculate cosine similarity between the query embedding and each chunk embedding
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]

    # Get the indices of the top-k most similar chunks
    top_indices = np.argsort(similarities)[-k:][::-1]

    # Return the top-k most relevant text chunks
    return [text_chunks[i] for i in top_indices]

In [None]:
# Load the validation data from a JSON file
with open('val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[3]['question']

# Retrieve relevant chunks for each chunk size
retrieved_chunks_dict = {size: retrieve_relevant_chunks(query, text_chunks_dict[size], chunk_embeddings_dict[size]) for size in chunk_sizes}

# Print retrieved chunks for chunk size 256
print(retrieved_chunks_dict[256])

['AI enables personalized medicine by analyzing individual patient data, predicting treatment \nresponses, and tailoring interventions. Personalized medicine enhances treatment effectiveness \nand reduces adverse effects. \nRobotic Surgery \nAI-powered robotic s', 'g \npatient outcomes, and assisting in treatment planning. AI-powered tools enhance accuracy, \nefficiency, and patient care. \nDrug Discovery and Development \nAI accelerates drug discovery and development by analyzing biological data, predicting drug \neffica', ' analyzing biological data, predicting drug \nefficacy, and identifying potential drug candidates. AI-powered systems reduce the time and cost \nof bringing new treatments to market. \nPersonalized Medicine \nAI enables personalized medicine by analyzing indiv', 'mains. \nThese applications include: \nHealthcare \nAI is transforming healthcare through applications such as medical diagnosis, drug discovery, \npersonalized medicine, and robotic surgery. AI-powered to

## Generating a Response Based on Retrieved Chunks
Let's  generate a response based on the retrieved text for chunk size `256`.

In [None]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(query, system_prompt, retrieved_chunks, model="meta-llama/Llama-3.2-3B-Instruct"):
    """
    Generates an AI response based on retrieved chunks.

    Args:
    query (str): User query.
    retrieved_chunks (List[str]): List of retrieved text chunks.
    model (str): AI model.

    Returns:
    str: AI-generated response.
    """
    # Combine retrieved chunks into a single context string
    context = "\n".join([f"Context {i+1}:\n{chunk}" for i, chunk in enumerate(retrieved_chunks)])

    # Create the user prompt by combining the context and the query
    user_prompt = f"{context}\n\nQuestion: {query}"

    # Generate the AI response using the specified model
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    # Return the content of the AI response
    return response.choices[0].message.content

# Generate AI responses for each chunk size
ai_responses_dict = {size: generate_response(query, system_prompt, retrieved_chunks_dict[size]) for size in chunk_sizes}

# Print the response for chunk size 256
print(ai_responses_dict[256])

InternalServerError: Error code: 503 - {'error': {'message': '当前分组 default 下对于模型 meta-llama/Llama-3.2-3B-Instruct 无可用渠道 (request id: 20250612184613617348334pA2QvYfl)', 'type': 'new_api_error'}}

In [None]:
def generate_response(query, system_prompt, retrieved_chunks, model="gpt-3.5-turbo"):
    """使用可用模型生成响应"""
    context = "\n".join([f"Context {i+1}:\n{chunk}" for i, chunk in enumerate(retrieved_chunks)])
    user_prompt = f"{context}\n\nQuestion: {query}"

    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.choices[0].message.content

# 生成响应
ai_responses_dict = {
    size: generate_response(query, system_prompt, retrieved_chunks_dict[size])
    for size in chunk_sizes
}

In [None]:
# 列出所有可用模型
models = client.models.list()
available_models = [m.id for m in models.data]
print("可用模型:", available_models)

# 选择列表中存在的模型
model = next((m for m in available_models if "gpt" in m), "gpt-3.5-turbo")

可用模型: ['claude-3-5-sonnet-20240620', 'claude-3-5-sonnet-20241022', 'claude-3-7-sonnet-20250219', 'claude-opus-4-20250514', 'claude-sonnet-4-20250514', 'deepseek-r1', 'DeepSeek-R1-Distill-Qwen-32B', 'DeepSeek-R1-Distill-Qwen-7B', 'deepseek-v3', 'Doubao-1.5-vision-pro-32k', 'Doubao-embedding', 'Doubao-lite-128k', 'Doubao-lite-32k', 'Doubao-lite-4k', 'Doubao-pro-128k', 'Doubao-pro-32k', 'Doubao-pro-4k', 'gemini-2.0-flash', 'gemini-2.0-flash-lite', 'gemini-2.5-flash-preview-04-17', 'gemini-2.5-pro-preview-03-25', 'gemini-2.5-pro-preview-05-06', 'gpt-3.5-turbo', 'gpt-4', 'gpt-4-32k', 'gpt-4-turbo', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4o', 'gpt-4o-mini', 'kimi-latest', 'kimi-thinking-preview', 'moonshot-small', 'moonshot-v1-128k', 'moonshot-v1-128k-vision-preview', 'moonshot-v1-32k', 'moonshot-v1-32k-vision-preview', 'moonshot-v1-8k', 'moonshot-v1-8k-vision-preview', 'moonshot-v1-auto', 'o1', 'o1-mini', 'o3', 'o3-mini', 'Pro-DeepSeek-R1', 'Pro-DeepSeek-V3', 'qwen-max', 'qwen-plus

## Evaluating the AI Response
We score responses based on faithfulness and relevancy using powerfull llm

In [None]:
# Define evaluation scoring system constants
SCORE_FULL = 1.0     # Complete match or fully satisfactory
SCORE_PARTIAL = 0.5  # Partial match or somewhat satisfactory
SCORE_NONE = 0.0     # No match or unsatisfactory

In [None]:
# Define strict evaluation prompt templates
FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the faithfulness of the AI response compared to the true answer.
User Query: {question}
AI Response: {response}
True Answer: {true_answer}

Faithfulness measures how well the AI response aligns with facts in the true answer, without hallucinations.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely faithful, no contradictions with true answer
    * {partial} = Partially faithful, minor contradictions
    * {none} = Not faithful, major contradictions or hallucinations
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [None]:
RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the relevancy of the AI response to the user query.
User Query: {question}
AI Response: {response}

Relevancy measures how well the response addresses the user's question.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely relevant, directly addresses the query
    * {partial} = Partially relevant, addresses some aspects
    * {none} = Not relevant, fails to address the query
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [None]:
def evaluate_response(question, response, true_answer):
        """
        Evaluates the quality of an AI-generated response based on faithfulness and relevancy.

        Args:
        question (str): The user's original question.
        response (str): The AI-generated response being evaluated.
        true_answer (str): The correct answer used as ground truth.

        Returns:
        Tuple[float, float]: A tuple containing (faithfulness_score, relevancy_score).
                                                Each score is one of: 1.0 (full), 0.5 (partial), or 0.0 (none).
        """
        # Format the evaluation prompts
        faithfulness_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
                question=question,
                response=response,
                true_answer=true_answer,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        relevancy_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
                question=question,
                response=response,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        # Request faithfulness evaluation from the model
        faithfulness_response = client.chat.completions.create(
               model="meta-llama/Llama-3.2-3B-Instruct",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": faithfulness_prompt}
                ]
        )

        # Request relevancy evaluation from the model
        relevancy_response = client.chat.completions.create(
                model="meta-llama/Llama-3.2-3B-Instruct",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": relevancy_prompt}
                ]
        )

        # Extract scores and handle potential parsing errors
        try:
                faithfulness_score = float(faithfulness_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse faithfulness score, defaulting to 0")
                faithfulness_score = 0.0

        try:
                relevancy_score = float(relevancy_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse relevancy score, defaulting to 0")
                relevancy_score = 0.0

        return faithfulness_score, relevancy_score

# True answer for the first validation data
true_answer = data[3]['ideal_answer']

# Evaluate response for chunk size 256 and 128
faithfulness, relevancy = evaluate_response(query, ai_responses_dict[256], true_answer)
faithfulness2, relevancy2 = evaluate_response(query, ai_responses_dict[128], true_answer)

# print the evaluation scores
print(f"Faithfulness Score (Chunk Size 256): {faithfulness}")
print(f"Relevancy Score (Chunk Size 256): {relevancy}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 128): {faithfulness2}")
print(f"Relevancy Score (Chunk Size 128): {relevancy2}")

InternalServerError: Error code: 503 - {'error': {'message': '当前分组 default 下对于模型 meta-llama/Llama-3.2-3B-Instruct 无可用渠道 (request id: 20250612184802643703132BsYy6YbT)', 'type': 'new_api_error'}}

In [None]:
# 1. 修改嵌入模型（原BAAI/bge-en-icl）
def create_embeddings(texts, model="text-embedding-ada-002"):
    response = client.embeddings.create(model=model, input=texts)
    return [np.array(embedding.embedding) for embedding in response.data]

# 2. 修改生成模型（原meta-llama/Llama-3.2-3B-Instruct）
def generate_response(query, system_prompt, retrieved_chunks, model="gpt-3.5-turbo"):
    # 函数内容不变，仅修改model参数
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.choices[0].message.content

# 3. 修改评估模型（原meta-llama/Llama-3.2-3B-Instruct）
def evaluate_response(question, response, true_answer, model="gpt-3.5-turbo"):
    # 请求忠实度评估
    faithfulness_response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
            {"role": "user", "content": faithfulness_prompt}
        ]
    )
    # 请求相关性评估（同上，修改model参数）
    # ...

In [None]:
def evaluate_response(question, response, true_answer):
        """
        Evaluates the quality of an AI-generated response based on faithfulness and relevancy.

        Args:
        question (str): The user's original question.
        response (str): The AI-generated response being evaluated.
        true_answer (str): The correct answer used as ground truth.

        Returns:
        Tuple[float, float]: A tuple containing (faithfulness_score, relevancy_score).
                                                Each score is one of: 1.0 (full), 0.5 (partial), or 0.0 (none).
        """
        # Format the evaluation prompts
        faithfulness_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
                question=question,
                response=response,
                true_answer=true_answer,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        relevancy_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
                question=question,
                response=response,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        # Request faithfulness evaluation from the model
        faithfulness_response = client.chat.completions.create(
               model="gpt-3.5-turbo",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": faithfulness_prompt}
                ]
        )

        # Request relevancy evaluation from the model
        relevancy_response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": relevancy_prompt}
                ]
        )

        # Extract scores and handle potential parsing errors
        try:
                faithfulness_score = float(faithfulness_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse faithfulness score, defaulting to 0")
                faithfulness_score = 0.0

        try:
                relevancy_score = float(relevancy_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse relevancy score, defaulting to 0")
                relevancy_score = 0.0

        return faithfulness_score, relevancy_score

# True answer for the first validation data
true_answer = data[3]['ideal_answer']

# Evaluate response for chunk size 256 and 128
faithfulness, relevancy = evaluate_response(query, ai_responses_dict[256], true_answer)
faithfulness2, relevancy2 = evaluate_response(query, ai_responses_dict[128], true_answer)

# print the evaluation scores
print(f"Faithfulness Score (Chunk Size 256): {faithfulness}")
print(f"Relevancy Score (Chunk Size 256): {relevancy}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 128): {faithfulness2}")
print(f"Relevancy Score (Chunk Size 128): {relevancy2}")

Faithfulness Score (Chunk Size 256): 0.5
Relevancy Score (Chunk Size 256): 1.0


Faithfulness Score (Chunk Size 128): 0.5
Relevancy Score (Chunk Size 128): 1.0
