
# LangChain RAG Playground for the `data/` Directory

This notebook sets up a minimal retrieval-augmented generation (RAG) workflow powered by LangChain. It ingests the resources stored in `../data` (a PDF research paper and a Chinese-language text file), builds a vector index, and exposes a helper function for interactive question answering.

> ℹ️ Feel free to clone and extend these building blocks—swap in another LLM, persist the vector store, or wire the retriever into an application backend.



## 1. Environment Setup

Uncomment and run the cell below if the required packages are not yet installed in your environment.


In [None]:

# !pip install -q --upgrade langchain langchain-community langchain-openai docling #     chromadb modelscope pypdf



## 2. Imports and Configuration

We keep paths relative so the notebook works whether it is launched from the project root or the `notebooks/` directory. The loader for the Chinese text file tries several common encodings before failing.


In [None]:

from pathlib import Path
from typing import Iterable, Optional
import re

from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import ModelScopeEmbeddings
from docling.document_converter import DocumentConverter

# If you plan to use OpenAI, make sure langchain-openai is installed and set OPENAI_API_KEY.
try:
    from langchain_openai import ChatOpenAI
except ImportError:
    ChatOpenAI = None

PROJECT_ROOT_CANDIDATES = [Path.cwd(), Path.cwd().parent]
for candidate in PROJECT_ROOT_CANDIDATES:
    data_dir = candidate / "data"
    if data_dir.exists():
        PROJECT_ROOT = candidate
        DATA_DIR = data_dir
        break
else:
    raise FileNotFoundError("Could not locate the 'data' directory. Update the path resolution logic above.")

print(f"Using project root: {PROJECT_ROOT}")
print(f"Available data files: {[p.name for p in DATA_DIR.iterdir()]}")



## 3. Load Raw Documents

Docling handles both PDF and rich-text parsing so we can derive structured Markdown representations. We still keep a manual encoding fallback for plain text files as a last resort.


In [None]:

ENCODING_CANDIDATES: Iterable[str] = ("utf-8", "utf-8-sig", "gb18030", "iso-8859-1")

DOC_CONVERTER = DocumentConverter()
SECTION_HEADING_RE = re.compile(r"^(#{1,6})\s+(.*)")


def load_text_with_fallback(path: Path, encodings: Iterable[str] = ENCODING_CANDIDATES) -> Document:
    last_error: Optional[Exception] = None
    for enc in encodings:
        try:
            text = path.read_text(encoding=enc)
            metadata = {"source": str(path.relative_to(PROJECT_ROOT)), "encoding": enc}
            return Document(page_content=text, metadata=metadata)
        except (UnicodeDecodeError, LookupError) as err:
            last_error = err
    if last_error is not None:
        raise last_error
    raise UnicodeDecodeError("Unable to decode", b"", 0, 0, "No encoding candidates were successful")


def iter_markdown_sections(markdown: str):
    current_heading: Optional[str] = None
    current_buffer: list[str] = []
    for line in markdown.splitlines():
        match = SECTION_HEADING_RE.match(line)
        if match:
            if current_buffer:
                yield current_heading, "
".join(current_buffer).strip()
                current_buffer = []
            current_heading = match.group(2).strip()
        else:
            current_buffer.append(line)
    if current_buffer:
        yield current_heading, "
".join(current_buffer).strip()


def chunk_section_text(text: str, max_chars: int = 1200):
    paragraphs = [para.strip() for para in re.split(r"
\s*
", text) if para.strip()]
    if not paragraphs:
        cleaned = text.strip()
        if cleaned:
            yield cleaned
        return
    buffer: list[str] = []
    length = 0
    for para in paragraphs:
        para_len = len(para)
        if buffer and length + para_len > max_chars:
            yield "

".join(buffer)
            buffer = [para]
            length = para_len
        else:
            buffer.append(para)
            length += para_len
    if buffer:
        yield "

".join(buffer)


def build_semantic_documents_from_markdown(markdown: str, path: Path, base_metadata: Optional[dict] = None) -> list[Document]:
    base_metadata = base_metadata.copy() if base_metadata else {}
    documents: list[Document] = []
    section_counter = 0
    for section_counter, (heading, body) in enumerate(iter_markdown_sections(markdown), start=1):
        if not body:
            continue
        title = heading or path.stem
        for chunk_index, chunk in enumerate(chunk_section_text(body), start=1):
            content = f"{title}

{chunk}" if heading else chunk
            metadata = {
                "source": str(path.relative_to(PROJECT_ROOT)),
                "parser": "docling",
                "section_title": title,
                "section_order": section_counter,
                "chunk_index": chunk_index,
            }
            metadata.update(base_metadata)
            documents.append(Document(page_content=content.strip(), metadata=metadata))
    if not documents:
        cleaned = markdown.strip()
        if cleaned:
            metadata = {
                "source": str(path.relative_to(PROJECT_ROOT)),
                "parser": "docling",
                "section_title": path.stem,
                "section_order": 1,
                "chunk_index": 1,
            }
            metadata.update(base_metadata)
            documents.append(Document(page_content=cleaned, metadata=metadata))
    return documents


def convert_with_docling(path: Path) -> list[Document]:
    base_metadata = {"original_name": path.name}
    try:
        result = DOC_CONVERTER.convert(str(path))
        markdown = result.document.export_to_markdown()
        return build_semantic_documents_from_markdown(markdown, path, base_metadata)
    except UnicodeDecodeError:
        fallback_doc = load_text_with_fallback(path)
        base_metadata.update(fallback_doc.metadata)
        return build_semantic_documents_from_markdown(fallback_doc.page_content, path, base_metadata)
    except Exception as err:
        print(f"⚠️ Docling failed for {path.name}: {err}. Falling back to plain-text decoding.")
        fallback_doc = load_text_with_fallback(path)
        base_metadata.update(fallback_doc.metadata)
        return build_semantic_documents_from_markdown(fallback_doc.page_content, path, base_metadata)


In [None]:

# Parse documents with Docling and build semantic chunks.
semantic_docs: list[Document] = []
for data_file in sorted(DATA_DIR.glob("*")):
    if data_file.suffix.lower() not in {".pdf", ".txt", ".md"}:
        print(f"Skipping unsupported file: {data_file.name}")
        continue
    docs_from_file = convert_with_docling(data_file)
    semantic_docs.extend(docs_from_file)
    print(f"Parsed {data_file.name} into {len(docs_from_file)} semantic chunks")

print(f"Total semantic chunks: {len(semantic_docs)}")
if semantic_docs:
    preview = {
        doc.metadata.get("section_title", doc.metadata.get("source", "unknown")): len(doc.page_content)
        for doc in semantic_docs[:5]
    }
    print(f"Preview chunk lengths: {preview}")



## 4. Chunk Documents and Build the Vector Store

Instead of fixed-length windows, we reuse Docling's structured Markdown output to group content by headings and logical paragraphs. The resulting chunks better follow semantic boundaries before they are embedded.


In [None]:

# No additional chunking required; Docling already produced semantically-aware segments.
chunked_docs = semantic_docs
print(f"Using {len(chunked_docs)} Docling-generated chunks for the vector store")


In [None]:

# ModelScope hosts multilingual embedding models; swap to a lighter model ID if resources are limited.
embedding_model_id = "damo/nlp_corom_sentence-embedding_zh-cn-base"
embeddings = ModelScopeEmbeddings(model_id=embedding_model_id)

vector_store = Chroma.from_documents(documents=chunked_docs, embedding=embeddings, collection_name="rag-data")
retriever = vector_store.as_retriever()



## 5. Choose an LLM Backend

By default the notebook expects an OpenAI-compatible chat model. Uncomment the block that matches your setup:

- **OpenAI:** requires `langchain-openai` and `OPENAI_API_KEY`.
- **Ollama / other local models:** replace the cell with the appropriate LangChain chat wrapper (e.g. `from langchain_community.chat_models import ChatOllama`).


In [None]:

if ChatOpenAI is None:
    raise ImportError("Install langchain-openai to use ChatOpenAI, or swap in another chat model.")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)



## 6. Build the Retrieval-Augmented Chain

`RetrievalQA` wraps the retriever behind a simple `.invoke({"query": ...})` interface and returns both the answer and the supporting source documents.


In [None]:

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
)



## 7. Ask Questions

Use the helper below to query the knowledge base interactively. Set `show_sources=True` to inspect the retrieved context.


In [None]:

def ask(question: str, show_sources: bool = True) -> None:
    response = qa_chain({"query": question})
    answer = response["result"]
    print(f"
Q: {question}
A: {answer}
")
    if show_sources:
        print("Sources:")
        for idx, doc in enumerate(response.get("source_documents", []), start=1):
            source = doc.metadata.get("source", "unknown")
            snippet = doc.page_content[:200].replace("
", " ") + ("…" if len(doc.page_content) > 200 else "")
            print(f"[{idx}] {source}
    {snippet}
")

# Example question (customize to your needs)
ask("这份资料中提到的主要主题是什么？")



---

### Next Ideas
- Persist the `Chroma` index to disk (`persist_directory=...`) for faster reloads.
- Swap in a chat-oriented chain (e.g. `ConversationalRetrievalChain`) to maintain dialogue history.
- Attach tracing via `langsmith` or `langchain.debug` to inspect retrieval quality.
