
# LangChain RAG Playground for the `data/` Directory

This notebook sets up a minimal retrieval-augmented generation (RAG) workflow powered by LangChain. It ingests the resources stored in `../data` (a PDF research paper and a Chinese-language text file), builds a vector index, and exposes a helper function for interactive question answering.

> ℹ️ Feel free to clone and extend these building blocks—swap in another LLM, persist the vector store, or wire the retriever into an application backend.



## 1. Environment Setup

Uncomment and run the cell below if the required packages are not yet installed in your environment.


In [None]:

# !pip install -q --upgrade langchain langchain-community langchain-openai langchain-text-splitters #     chromadb sentence-transformers pypdf



## 2. Imports and Configuration

We keep paths relative so the notebook works whether it is launched from the project root or the `notebooks/` directory. The loader for the Chinese text file tries several common encodings before failing.


In [None]:

from pathlib import Path
from typing import Iterable, Optional

from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# If you plan to use OpenAI, make sure langchain-openai is installed and set OPENAI_API_KEY.
try:
    from langchain_openai import ChatOpenAI
except ImportError:
    ChatOpenAI = None

PROJECT_ROOT_CANDIDATES = [Path.cwd(), Path.cwd().parent]
for candidate in PROJECT_ROOT_CANDIDATES:
    data_dir = candidate / "data"
    if data_dir.exists():
        PROJECT_ROOT = candidate
        DATA_DIR = data_dir
        break
else:
    raise FileNotFoundError("Could not locate the 'data' directory. Update the path resolution logic above.")

print(f"Using project root: {PROJECT_ROOT}")
print(f"Available data files: {[p.name for p in DATA_DIR.iterdir()]}")



## 3. Load Raw Documents

`load_text_with_fallback` tries multiple encodings because `data.txt` ships with non-UTF8 characters. Adjust or extend the encoding list if decoding still fails.


In [None]:

ENCODING_CANDIDATES: Iterable[str] = ("utf-8", "utf-8-sig", "gb18030", "iso-8859-1")

def load_text_with_fallback(path: Path, encodings: Iterable[str] = ENCODING_CANDIDATES) -> Document:
    last_error: Optional[Exception] = None
    for enc in encodings:
        try:
            text = path.read_text(encoding=enc)
            metadata = {"source": str(path.relative_to(PROJECT_ROOT)), "encoding": enc}
            return Document(page_content=text, metadata=metadata)
        except (UnicodeDecodeError, LookupError) as err:
            last_error = err
    raise UnicodeDecodeError("Unable to decode", str(path), 0, 0, str(last_error) if last_error else "Unknown error")

# Load the PDF with PyPDFLoader so each page becomes a Document.
pdf_path = DATA_DIR / "2101.03697v3.pdf"
pdf_docs = []
if pdf_path.exists():
    pdf_loader = PyPDFLoader(str(pdf_path))
    pdf_docs = pdf_loader.load()
else:
    print("⚠️ Skipping PDF because it was not found.")

# Load the text file with encoding fallback.
text_path = DATA_DIR / "data.txt"
text_docs = []
if text_path.exists():
    text_docs = [load_text_with_fallback(text_path)]
else:
    print("⚠️ Skipping text file because it was not found.")

raw_docs = pdf_docs + text_docs
print(f"Loaded {len(raw_docs)} documents")
print({doc.metadata.get('source', 'unknown'): len(doc.page_content) for doc in raw_docs})



## 4. Chunk Documents and Build the Vector Store

The splitter keeps some overlap between chunks, which helps the retriever provide richer context windows.


In [None]:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunked_docs = text_splitter.split_documents(raw_docs)
print(f"Chunked into {len(chunked_docs)} documents")


In [None]:

# SentenceTransformers models work well for multilingual text; change to a smaller model if resources are limited.
embedding_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

vector_store = Chroma.from_documents(documents=chunked_docs, embedding=embeddings, collection_name="rag-data")
retriever = vector_store.as_retriever()



## 5. Choose an LLM Backend

By default the notebook expects an OpenAI-compatible chat model. Uncomment the block that matches your setup:

- **OpenAI:** requires `langchain-openai` and `OPENAI_API_KEY`.
- **Ollama / other local models:** replace the cell with the appropriate LangChain chat wrapper (e.g. `from langchain_community.chat_models import ChatOllama`).


In [None]:

if ChatOpenAI is None:
    raise ImportError("Install langchain-openai to use ChatOpenAI, or swap in another chat model.")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)



## 6. Build the Retrieval-Augmented Chain

`RetrievalQA` wraps the retriever behind a simple `.invoke({"query": ...})` interface and returns both the answer and the supporting source documents.


In [None]:

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
)



## 7. Ask Questions

Use the helper below to query the knowledge base interactively. Set `show_sources=True` to inspect the retrieved context.


In [None]:

def ask(question: str, show_sources: bool = True) -> None:
    response = qa_chain({"query": question})
    answer = response["result"]
    print(f"
Q: {question}
A: {answer}
")
    if show_sources:
        print("Sources:")
        for idx, doc in enumerate(response.get("source_documents", []), start=1):
            source = doc.metadata.get("source", "unknown")
            snippet = doc.page_content[:200].replace("
", " ") + ("…" if len(doc.page_content) > 200 else "")
            print(f"[{idx}] {source}
    {snippet}
")

# Example question (customize to your needs)
ask("这份资料中提到的主要主题是什么？")



---

### Next Ideas
- Persist the `Chroma` index to disk (`persist_directory=...`) for faster reloads.
- Swap in a chat-oriented chain (e.g. `ConversationalRetrievalChain`) to maintain dialogue history.
- Attach tracing via `langsmith` or `langchain.debug` to inspect retrieval quality.
