# Use LangChain to build a RAG application

LangChain is a framework designed to simplify the development of applications that use large language models (LLMs) and other AI tools. It provides a set of abstractions and utilities that make it easier to build complex applications by chaining together different components, such as LLMs, data sources, and processing steps.

## Table of Contents 

1. Process Documents with LangChain
2. Chunking with LangChain
3. Vectorization with LangChain
4. Database with LangChain
5. Querying with LangChain
6. Retrieval with LangChain
7. Retrieval-Augmented Generation (RAG) with LangChain

## Load and Process Documents with LangChain

To load and process documents in LangChain, you can use the `DocumentLoader` class. This class provides a unified interface for loading documents from various sources, such as local files, URLs, or databases. You can also apply transformations to the loaded documents, such as text extraction or metadata extraction.

https://python.langchain.com/docs/concepts/document_loaders/



In [None]:
# Use PyMuPDFLoader to load documents
!pip install langchain-community pymupdf
!pip install langchain-community langchain-openai

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    file_path = "./resource/pdf/bmjgh-2021-005292.pdf",
    # headers = None
    # password = None,
    mode = "single",
    pages_delimiter = "\n\n"
)
documents = loader.load()

In [None]:
# get metadata, which is extracted when loading the document
documents[0].metadata  # Print metadata of the first document

In [None]:
print(f"Loaded {len(documents)} documents.")
print(documents[0].page_content[:500])  # Print the first 500 characters of the first document
print("==" * 20)
print("Metadata of the first document:")
for key, value in documents[0].metadata.items():
    print(f"{key}: {value}")


In [None]:
import fitz  # PyMuPDF

pdf_path = "./resource/pdf/bmjgh-2021-005292.pdf"
doc = fitz.open(pdf_path)

print("Embedded metadata:", doc.metadata, "\n")


# 如果 author 为空，尝试从第一页正文粗略猜测
if not doc.metadata.get("author"):
    first_text = doc[0].get_text()  # 默认提取文字
    lines = [l.strip() for l in first_text.splitlines() if l.strip()]
    candidates = []
    for line in lines:
        if line.lower().startswith("abstract"):  # 作者通常在 Abstract 之前
            break
        # 简单启发：含逗号 / 多个名字且不过长
        if ("," in line or " and " in line.lower()) and 3 <= len(line) <= 180:
            candidates.append(line)
    print("Heuristic author lines:")
    for i, line in enumerate(candidates, 1):
        print(f"{i}. {line}")
else:
    print("Author metadata already exists:", doc.metadata["author"])
    
# 关闭文档
doc.close()

Once we have the metadata, we can use it implement filtering or other logic in our RAG application.

In [None]:
# Filter logic using metadata
def filter_documents(documents_data, author=None, title_keyword=None, year=None, journal=None):
    """parameters:
    documents_data: list of metadata dicts containing 'author', 'title', 'year', 
    'journal' and index of the document
    """
    filtered = []
    for idx, meta in enumerate(documents_data):
        if author and (not meta.get("author") or author.lower() not in meta["author"].lower()):
            continue
        if title_keyword and (not meta.get("title") or title_keyword.lower() not in meta["title"].lower()):
            continue
        if year and (not meta.get("year") or str(year) != str(meta["year"])):
            continue
        if journal and (not meta.get("journal") or journal.lower() not in meta["journal"].lower()):
            continue
        filtered.append(idx)

    # return indices of documents that match the criteria
    return filtered


### Assign stable UUIDs to documents and chunks

- For each document: generate a deterministic `doc_id` (UUID v5) using DOI if available, otherwise a hash of source + first 256 chars.
- For each chunk: generate a deterministic `chunk_id` (UUID v5) using `doc_id`, chunk index, and a short hash of the chunk text.

This ensures IDs are stable across runs if the inputs do not change.

In [None]:
import uuid, hashlib
from typing import List, Dict

NAMESPACE = uuid.UUID("12345678-1234-5678-1234-567812345678")  # project-level constant

def stable_doc_id(meta: Dict, content_preview: str = "") -> str:
    """Create a deterministic UUID for a document.
    Priority: DOI -> URL -> (source + preview hash)
    """
    key = (
        meta.get("doi")
        or meta.get("url")
        or f"{meta.get('source', 'unknown')}::{hashlib.sha1(content_preview.encode('utf-8')).hexdigest()[:12]}"
    )
    return str(uuid.uuid5(NAMESPACE, key))

def stable_chunk_id(doc_id: str, idx: int, text: str) -> str:
    sig = f"{doc_id}:{idx}:{hashlib.sha1(text.encode('utf-8')).hexdigest()[:12]}"
    return str(uuid.uuid5(NAMESPACE, sig))

# Example: assign ids to loaded LangChain documents and their chunks
# Assume `documents` is a list of Document objects from LangChain (with .page_content and .metadata)

doc_records = []
for d in documents:
    preview = d.page_content[:256] if d.page_content else ""
    d.metadata = dict(d.metadata) if d.metadata else {}
    d.metadata["doc_id"] = stable_doc_id(d.metadata, preview)
    doc_records.append({"doc_id": d.metadata["doc_id"], **d.metadata})

print("Assigned document IDs (first 3):", [r["doc_id"] for r in doc_records[:3]])

# If you chunk later, reuse doc_id and generate chunk_id per chunk
# Example chunking (simple):

def simple_chunk(text: str, size: int = 800, overlap: int = 100) -> List[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = min(len(text), start + size)
        chunks.append(text[start:end])
        start = end - overlap
        if start < 0:
            start = 0
    return chunks

chunk_records = []
for d in documents:
    doc_id = d.metadata["doc_id"]
    chunks = simple_chunk(d.page_content)
    for i, ch in enumerate(chunks):
        chunk_id = stable_chunk_id(doc_id, i, ch)
        chunk_records.append({
            "doc_id": doc_id,
            "chunk_id": chunk_id,
            "chunk_index": i,
            "text": ch
        })

print("Sample chunk IDs:", [c["chunk_id"] for c in chunk_records[:5]])