- RAGAS overview: https://gist.github.com/donbr/1a1281f647419aaacb8673223b69569c
- https://github.com/explodinggradients/ragas/blob/main/docs/getstarted/rag_testset_generation.md
- None-English: https://docs.ragas.io/en/stable/howtos/customizations/testgenerator/_language_adaptation

In [2]:
pip install -qU --no-cache-dir langchain-openai==0.3.28 langchain-community==0.3.27 ragas==0.3.0 python-dotenv==1.1.1 unstructured[md]==0.18.5 pillow==10.4.0

  You can safely remove it manually.[0m[33m
  You can safely remove it manually.[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

In [1]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
import os

generator_llm = LangchainLLMWrapper(
    ChatOpenAI(
        model=os.getenv("GENERATOR_MODEL"),
        api_key=os.getenv("GENERATOR_API_KEY"),
        base_url=os.getenv("GENERATOR_BASE_URL")
    )
)

generator_embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(
        model=os.getenv("EMBEDDER_MODEL"),
        api_key=os.getenv("EMBEDDER_API_KEY"),
        base_url=os.getenv("EMBEDDER_BASE_URL")
    )
)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from langchain_community.document_loaders import DirectoryLoader

path = "docs/"
loader = DirectoryLoader(path, glob="**/*.md")
docs = loader.load()

In [4]:
%%script echo 'skipping this cell'

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_core.documents import Document

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"), 
    ("###", "Header 3"),
    ("####", "Header 4"),
    ("#####", "Header 5"),
    ("######", "Header 6"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,  # Keep headers in content for context
)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    add_start_index=True,
)
print("✅ Text Chunkers initialized.")

print("\n🔄 Chunking documents...")
md_split_docs = []
for doc in docs:
    md_header_splits = markdown_splitter.split_text(doc.page_content)
    md_header_splits = text_splitter.split_documents(md_header_splits)
    # Convert back to Document objects, preserving original metadata
    for split_chunk in md_header_splits:
        headings_list = []
        # Extract header values in order based on headers_to_split_on
        for _, header_meta_key_name in headers_to_split_on:
            if header_meta_key_name in split_chunk.metadata:
                headings_list.append(split_chunk.metadata[header_meta_key_name])

        md_split_docs.append(Document(
            page_content=split_chunk.page_content,
            metadata={**doc.metadata, "headings": headings_list}
        ))
print(f"👍 Documents split into {len(md_split_docs)} chunks.")
docs = md_split_docs

skipping this cell


In [5]:
len(docs)

1

In [6]:
docs[0]



## Adapt prompts to generate data in Chinese

### For single document
This is the default, using multiple documents option below for more than 1 document

In [None]:
from ragas.testset.synthesizers.single_hop.specific import (
    SingleHopSpecificQuerySynthesizer,
)


query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0),
]

for query, _ in query_distribution:
    prompts = await query.adapt_prompts("chinese", llm=generator_llm)
    query.set_prompts(**prompts)

### For multiple documents

In [21]:
%%script echo 'skipping this cell'
from ragas.testset.synthesizers import default_query_distribution

query_distribution = default_query_distribution(generator_llm)

for query, _ in query_distribution:
    prompts = await query.adapt_prompts("chinese", llm=generator_llm)
    query.set_prompts(**prompts)  

print(query_distribution)

skipping this cell


In [8]:
from ragas.testset.persona import Persona

personas = [
    Persona(name="学生", role_description="刚接触这个主题，需要基础解释"),
    Persona(name="专家", role_description="熟悉内容，关注深入问题"),
]

In [9]:
from ragas.testset.transforms.extractors.llm_based import NERExtractor
from ragas.testset.transforms.splitters import HeadlineSplitter

transforms = [HeadlineSplitter(), NERExtractor(llm=generator_llm)]

In [10]:
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType

kg = KnowledgeGraph()

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 1, relationships: 0)

In [None]:
from ragas.testset.transforms import apply_transforms
apply_transforms(kg, transforms)

Applying HeadlineSplitter:   0%|          | 0/1 [00:00<?, ?it/s]unable to apply transformation: 'headlines' property not found in this node
                                                                    

KnowledgeGraph(nodes: 1, relationships: 0)

In [18]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, 
                             embedding_model=generator_embeddings, 
                             persona_list=personas, 
                             knowledge_graph=kg)
dataset = generator.generate(
            testset_size=10, 
            query_distribution=query_distribution
        )

Generating Scenarios: 100%|██████████| 1/1 [00:01<00:00,  1.49s/it]
Generating Samples: 100%|██████████| 10/10 [00:08<00:00,  1.15it/s]


In [19]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"How does China's ancient wisdom, specifically ...",[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,"According to the text, Yi Xue, originating in ...",single_hop_specifc_query_synthesizer
1,根据这段文字，荣格对《周易》的评价是什么？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,根据这段文字，荣格高度评价《周易》的智慧价值。,single_hop_specifc_query_synthesizer
2,中华文化中的易学，它对现代科学有什么启发？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,易学从中国走向世界，引发跨学科研究热潮，包括社会科学、自然科学的多元视角。中外学者高度评价《...,single_hop_specifc_query_synthesizer
3,汉代易学的主要特点是什么？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,"汉代易学的一个特点是象数易学与五行学说的极端化。董仲舒提出""天人感应""，将自然现象与人事吉凶...",single_hop_specifc_query_synthesizer
4,中华文化中的易学，其科学价值主要体现在哪个方面？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,易学作为中华文化的智慧总汇，其核心价值在于思维科学领域，兼具理论深度与实践意义，需以开放态度...,single_hop_specifc_query_synthesizer
5,作为学生，我想了解一下易学从中国走向世界后，对全球产生了哪些影响？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,近代以来，易学从中国走向世界，引发跨学科研究热潮，包括社会科学、自然科学的多元视角。,single_hop_specifc_query_synthesizer
6,请问《神奇之门》这本书里提到的诺贝尔奖和易学有什么关系？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,《神奇之门》这本书中提到，中外学者高度评价《周易》的智慧价值，认为其宇宙观与规律性研究对现代...,single_hop_specifc_query_synthesizer
7,根据你对易学现代研究的了解，张协和对《周易》的评价是什么？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,张协和高度评价《周易》的智慧价值，认为其宇宙观与规律性研究对现代科学有启发，甚至与诺贝尔奖成...,single_hop_specifc_query_synthesizer
8,请问根据《神奇之门: 奇门遁甲大解谜》这本书，肿华文化的核心价值是什么？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,根据《神奇之门: 奇门遁甲大解谜》这本书，中华文化的核心价值在于思维科学领域，兼具理论深度与...,single_hop_specifc_query_synthesizer
9,谁撰写了《易传》，并且它对孔子研究《周易》有何影响？,[《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精华...,孔子撰写了《易传》，提升了《周易》的哲学与社会价值。孔子晚年深入研究《周易》，撰写《易传》。,single_hop_specifc_query_synthesizer


In [20]:
eval_dataset = dataset.to_evaluation_dataset()
print("Query:", eval_dataset[0].user_input)
print("Reference:", eval_dataset[0].reference)

Query: How does China's ancient wisdom, specifically Yi Xue, relate to modern scientific thought, and are there specific examples of this connection?
Reference: According to the text, Yi Xue, originating in China, is considered a treasure trove of Chinese culture with its core value residing in the field of thinking science. It offers unique cognitive models like dialectical and象数思维, which provide guidance for natural science, social science, and life practice. Some scholars believe 《周易》's wisdom and its research on the universe and its laws are insightful for modern science, such as quantum physics, and even related to Nobel Prize-winning achievements. The阴阳爻 are similar to the computer's binary code of 0 and 1, showcasing highly abstract symbolic thinking. The text also mentions that the易学符号系统, though originating from intuitive analogy, has structural laws that coincide with later scientific discoveries like genetics and mathematics, suggesting it contains model thinking that transce

In [None]:
dataset.to_pandas().to_csv('data.csv', encoding='utf-8-sig', index=False)

In [23]:
import pandas as pd

df = pd.read_csv('data.csv', encoding='utf-8-sig')
df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"How does China's ancient wisdom, specifically ...",['《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精...,"According to the text, Yi Xue, originating in ...",single_hop_specifc_query_synthesizer
1,根据这段文字，荣格对《周易》的评价是什么？,['《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精...,根据这段文字，荣格高度评价《周易》的智慧价值。,single_hop_specifc_query_synthesizer
2,中华文化中的易学，它对现代科学有什么启发？,['《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精...,易学从中国走向世界，引发跨学科研究热潮，包括社会科学、自然科学的多元视角。中外学者高度评价《...,single_hop_specifc_query_synthesizer
3,汉代易学的主要特点是什么？,['《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精...,"汉代易学的一个特点是象数易学与五行学说的极端化。董仲舒提出""天人感应""，将自然现象与人事吉凶...",single_hop_specifc_query_synthesizer
4,中华文化中的易学，其科学价值主要体现在哪个方面？,['《神奇之门: 奇门遁甲大解谜》\n\n张志春著\n\n上 编\n\n易学思维中的科学性精...,易学作为中华文化的智慧总汇，其核心价值在于思维科学领域，兼具理论深度与实践意义，需以开放态度...,single_hop_specifc_query_synthesizer
