# LangChain Notebook (Part 1): Readers + `Document` object + Cleaning (in depth)

This notebook focuses on **only**:
1. **Readers / Loaders** → how data becomes LangChain `Document` objects (with metadata)
2. **Cleaning** → practical, production-style text cleanup **while preserving metadata**

You can run this notebook end-to-end on your machine.

## 0) Install & Imports

> If you already have LangChain installed, you can skip the install cell.

We'll use:
- `langchain-core` for the `Document`
- `langchain-community` for common loaders like `TextLoader`, `DirectoryLoader`, `PyPDFLoader`

In [1]:
# If needed (run once)
# %pip install -U langchain langchain-core langchain-community pypdf unstructured

import os
import re
import unicodedata
from pathlib import Path
from typing import List, Dict, Any, Iterable, Callable

from langchain_core.documents import Document

# Loaders (Readers)
from langchain_community.document_loaders import TextLoader, DirectoryLoader

# Optional: PDF loader (needs pypdf)
try:
    from langchain_community.document_loaders import PyPDFLoader
except Exception:
    PyPDFLoader = None

print("Ready ✅")

Ready ✅


## 1) What is a LangChain `Document`?

A `Document` is the **standard container** LangChain uses across loaders, splitters, vector DBs, retrievers, etc.

It has two main fields:
- `page_content`: the text content
- `metadata`: a dictionary (source, file path, page number, etc.)

**Why metadata matters (production):**
- You need to show citations ("source: file.pdf page 3")
- You need filtering ("only docs from customer=A")
- You need traceability & debugging

In [2]:
doc = Document(
    page_content="Hello! This is a sample document.",
    metadata={"source": "manual", "topic": "demo"}
)

doc, doc.page_content, doc.metadata

(Document(metadata={'source': 'manual', 'topic': 'demo'}, page_content='Hello! This is a sample document.'),
 'Hello! This is a sample document.',
 {'source': 'manual', 'topic': 'demo'})

## 2) Create a small local dataset (for demo)

We'll create a mini folder with a few `.txt` files containing:
- extra whitespace
- weird unicode characters
- repeated “header/footer” like PDF exports
- some boilerplate lines

So we can practice **realistic cleaning**.

In [3]:
DATA_DIR = Path("demo_docs")
DATA_DIR.mkdir(exist_ok=True)

(DATA_DIR / "doc1.txt").write_text(
"""ACME SUPPORT PORTAL — INTERNAL USE ONLY
------------------------------------------

Hello   team,

This   is    a    test document.  

It contains    extra spaces,    odd line breaks,
and some unicode like café, naïve, and “smart quotes”.

ACME SUPPORT PORTAL — INTERNAL USE ONLY
Page 1
""", 
encoding="utf-8"
)

(DATA_DIR / "doc2.txt").write_text(
"""ACME SUPPORT PORTAL — INTERNAL USE ONLY
------------------------------------------

FAQ:
1) Reset password — go to Settings → Security.
2) Contact support at support@example.com

Disclaimer: This email and any attachments are confidential.
Disclaimer: This email and any attachments are confidential.

ACME SUPPORT PORTAL — INTERNAL USE ONLY
Page 2
""", 
encoding="utf-8"
)

(DATA_DIR / "doc3.txt").write_text(
"""Report Title: Quarterly Summary

\t\tThis line starts with tabs.
\n\n\nMultiple blank lines above.

• Bullet 1
• Bullet 2

Footer: Company Confidential
Footer: Company Confidential
Footer: Company Confidential
""", 
encoding="utf-8"
)

print("Created files:", [p.name for p in DATA_DIR.glob("*.txt")])

Created files: ['doc1.txt', 'doc3.txt', 'doc2.txt']


## 3) Reader 1: `TextLoader` (single file)

`TextLoader` turns one file into a list of `Document` objects (usually 1 document for a txt file).

Check how metadata is stored (typically includes the file path as `source`).

In [4]:
single_docs = TextLoader(str(DATA_DIR / "doc1.txt"), encoding="utf-8").load()
len(single_docs), single_docs[0].metadata, single_docs[0].page_content[:200]

(1,
 {'source': 'demo_docs/doc1.txt'},
 'ACME SUPPORT PORTAL — INTERNAL USE ONLY\n------------------------------------------\n\nHello   team,\n\nThis   is    a    test document.  \n\nIt contains    extra spaces,    odd line breaks,\nand some unicode')

## 4) Reader 2: `DirectoryLoader` (many files)

`DirectoryLoader` is used for ingesting a folder.
- `glob="**/*.txt"` picks matching files recursively
- `loader_cls=TextLoader` tells it how to load each file

In [5]:
dir_loader = DirectoryLoader(
    str(DATA_DIR),
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)

docs: List[Document] = dir_loader.load()
print("Documents loaded:", len(docs))

# Show a quick summary
for i, d in enumerate(docs, start=1):
    print(f"--- Doc {i} ---")
    print("source:", d.metadata.get("source"))
    print("chars:", len(d.page_content))
    print("preview:", repr(d.page_content[:120]))

Documents loaded: 3
--- Doc 1 ---
source: demo_docs/doc1.txt
chars: 287
preview: 'ACME SUPPORT PORTAL — INTERNAL USE ONLY\n------------------------------------------\n\nHello   team,\n\nThis   is    a    tes'
--- Doc 2 ---
source: demo_docs/doc3.txt
chars: 205
preview: 'Report Title: Quarterly Summary\n\n\t\tThis line starts with tabs.\n\n\n\nMultiple blank lines above.\n\n• Bullet 1\n• Bullet 2\n\nFo'
--- Doc 3 ---
source: demo_docs/doc2.txt
chars: 349
preview: 'ACME SUPPORT PORTAL — INTERNAL USE ONLY\n------------------------------------------\n\nFAQ:\n1) Reset password — go to Setti'


## 5) Optional Reader: `PyPDFLoader` (PDF → per-page Documents)

If you load a PDF, you usually get **one Document per page**, with metadata like:
- `source` (file path)
- `page` (page number)

> This section is optional; you can run it if you have a PDF file locally.

Example usage:

```python
pdf_docs = PyPDFLoader("myfile.pdf").load()
pdf_docs[0].metadata  # includes page
```

In [None]:
# OPTIONAL: Only run if you have a PDF file.
# Put a pdf path below and uncomment.

# PDF_PATH = "sample.pdf"
# if PyPDFLoader is None:
#     print("PyPDFLoader not available. Install pypdf: %pip install -U pypdf")
# else:
#     pdf_docs = PyPDFLoader(PDF_PATH).load()
#     print("PDF pages loaded:", len(pdf_docs))
#     print("First page metadata:", pdf_docs[0].metadata)
#     print(pdf_docs[0].page_content[:300])

# Cleaning (in depth)

Cleaning is not “one perfect function.” In production, you build a **pipeline**:
1. Normalize unicode (quotes, weird spacing)
2. Standardize whitespace
3. Remove boilerplate (disclaimers, repeated banners)
4. Remove repeated headers/footers (common in PDFs)
5. Deduplicate repeated lines
6. Keep metadata + add cleaning metadata for traceability

We'll implement each step as a **small function** and compose them.

## 6) Step A — Unicode normalization

Why:
- Different sources contain fancy quotes, non-breaking spaces, unusual dashes
- Normalization makes text consistent for embeddings and retrieval

We’ll use:
- `unicodedata.normalize("NFKC", text)`
- Replace non-breaking spaces with normal spaces

In [6]:
def normalize_unicode(text: str) -> str:
    # NFKC: canonicalize compatibility characters (common for scraped/converted text)
    text = unicodedata.normalize("NFKC", text)
    # Convert non-breaking space to normal space
    text = text.replace("\u00A0", " ")
    return text

sample = 'café “smart quotes”\u00A0with\u00A0nbsp'
print("BEFORE:", sample)
print("AFTER :", normalize_unicode(sample))

BEFORE: café “smart quotes” with nbsp
AFTER : café “smart quotes” with nbsp


## 7) Step B — Whitespace normalization

Why:
- Multiple spaces, tab-indents, and too many blank lines reduce embedding quality
- Makes chunks more uniform (later)

Typical operations:
- Convert tabs → single space
- Collapse multiple spaces
- Collapse multiple blank lines
- Strip trailing spaces

In [7]:
def normalize_whitespace(text: str) -> str:
    text = text.replace("\t", " ")
    # Remove trailing spaces on each line
    text = "\n".join(line.rstrip() for line in text.splitlines())
    # Collapse multiple spaces
    text = re.sub(r"[ ]{2,}", " ", text)
    # Collapse 3+ newlines to just 2 (keeps paragraph breaks)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

messy = "Hello\t\tworld!   This   has   spaces.\n\n\nAnd many blanks.   \n"
print("BEFORE:", repr(messy))
print("AFTER :", repr(normalize_whitespace(messy)))

BEFORE: 'Hello\t\tworld!   This   has   spaces.\n\n\nAnd many blanks.   \n'
AFTER : 'Hello world! This has spaces.\n\nAnd many blanks.'


## 8) Step C — Remove known boilerplate patterns

In email-like or corporate docs, you’ll see repeated boilerplate:
- legal disclaimers
- banners like "INTERNAL USE ONLY"
- signatures

Approach:
- Keep a list of regex patterns (case-insensitive)
- Remove matching lines

> Important: Over-aggressive boilerplate removal can delete useful info.
So keep rules explicit + versioned.

In [8]:
BOILERPLATE_PATTERNS = [
    r"^Disclaimer:.*confidential\.?$",
    r"^ACME SUPPORT PORTAL — INTERNAL USE ONLY$",
    r"^-{5,}$",  # long separator lines
]

compiled_bp = [re.compile(p, flags=re.IGNORECASE) for p in BOILERPLATE_PATTERNS]

def remove_boilerplate_lines(text: str, patterns=compiled_bp) -> str:
    kept_lines = []
    for line in text.splitlines():
        if any(p.match(line.strip()) for p in patterns):
            continue
        kept_lines.append(line)
    return "\n".join(kept_lines)

demo = docs[1].page_content
print("ORIGINAL (snippet):")
print(demo[:220])
print("\nCLEANED (snippet):")
print(remove_boilerplate_lines(demo)[:220])

ORIGINAL (snippet):
Report Title: Quarterly Summary

		This line starts with tabs.



Multiple blank lines above.

• Bullet 1
• Bullet 2

Footer: Company Confidential
Footer: Company Confidential
Footer: Company Confidential


CLEANED (snippet):
Report Title: Quarterly Summary

		This line starts with tabs.



Multiple blank lines above.

• Bullet 1
• Bullet 2

Footer: Company Confidential
Footer: Company Confidential
Footer: Company Confidential


## 9) Step D — Remove repeated headers/footers (heuristic)

This is super common in PDFs:
- Each page repeats the same header & footer
- When you load pages, those lines appear again and again

Heuristic strategy:
1. Split into lines
2. Count frequency of each non-trivial line across the document (or across pages)
3. Remove lines that appear too often (above a threshold)

We’ll demonstrate a simple version that works well for many cases.

In [9]:
from collections import Counter

def remove_repeated_lines(text: str, min_line_len: int = 10, freq_threshold: float = 0.30) -> str:
    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
    if not lines:
        return text

    # Count line frequencies
    counts = Counter(ln for ln in lines if len(ln) >= min_line_len)
    total = len(lines)
    # Identify lines repeated too frequently
    repeated = {ln for ln, c in counts.items() if c / total >= freq_threshold}

    # Filter them out
    new_lines = []
    for ln in text.splitlines():
        s = ln.strip()
        if s and s in repeated:
            continue
        new_lines.append(ln)
    return "\n".join(new_lines).strip()

text_with_repeats = docs[2].page_content
print("BEFORE:\n", text_with_repeats)
print("\nAFTER:\n", remove_repeated_lines(text_with_repeats, min_line_len=8, freq_threshold=0.20))

BEFORE:
 ACME SUPPORT PORTAL — INTERNAL USE ONLY
------------------------------------------

FAQ:
1) Reset password — go to Settings → Security.
2) Contact support at support@example.com

Disclaimer: This email and any attachments are confidential.
Disclaimer: This email and any attachments are confidential.

ACME SUPPORT PORTAL — INTERNAL USE ONLY
Page 2


AFTER:
 ------------------------------------------

FAQ:
1) Reset password — go to Settings → Security.
2) Contact support at support@example.com


Page 2


## 10) Step E — Deduplicate consecutive duplicate lines

Sometimes exports repeat the same footer line multiple times consecutively.

We'll remove **consecutive duplicates** while keeping paragraphs intact.

In [10]:
def dedupe_consecutive_lines(text: str) -> str:
    out = []
    prev = None
    for ln in text.splitlines():
        if prev is not None and ln.strip() and ln.strip() == prev.strip():
            continue
        out.append(ln)
        prev = ln
    return "\n".join(out).strip()

x = "Footer: Company Confidential\nFooter: Company Confidential\nFooter: Company Confidential\n\nHello\nHello\nWorld"
print("BEFORE:", x)
print("AFTER :", dedupe_consecutive_lines(x))

BEFORE: Footer: Company Confidential
Footer: Company Confidential
Footer: Company Confidential

Hello
Hello
World
AFTER : Footer: Company Confidential

Hello
World


## 11) Build a cleaning pipeline (compose steps)

We’ll build:
- `clean_text(text)` → returns cleaned text
- `clean_documents(docs)` → returns cleaned `Document` objects (same metadata)

**Production tip:**
Add cleaning metadata like:
- `cleaning_version`
- `original_char_count`
- `clean_char_count`
so you can debug later.

In [11]:
CLEANING_VERSION = "v1.0"

def clean_text(text: str) -> str:
    text = normalize_unicode(text)
    text = remove_boilerplate_lines(text)
    text = dedupe_consecutive_lines(text)
    text = normalize_whitespace(text)
    # Optional repeated-line removal (tune for your corpus)
    text = remove_repeated_lines(text, min_line_len=10, freq_threshold=0.25)
    text = normalize_whitespace(text)
    return text

def clean_documents(documents: List[Document]) -> List[Document]:
    cleaned_docs = []
    for d in documents:
        original = d.page_content
        cleaned = clean_text(original)

        new_meta = dict(d.metadata)
        new_meta.update({
            "cleaning_version": CLEANING_VERSION,
            "original_char_count": len(original),
            "clean_char_count": len(cleaned),
        })
        cleaned_docs.append(Document(page_content=cleaned, metadata=new_meta))
    return cleaned_docs

cleaned_docs = clean_documents(docs)

# Show diff-like preview for one doc
i = 0
print("SOURCE:", cleaned_docs[i].metadata.get("source"))
print("\n--- BEFORE (first 250 chars) ---\n", docs[i].page_content[:250])
print("\n--- AFTER  (first 250 chars) ---\n", cleaned_docs[i].page_content[:250])
print("\nMETADATA:", cleaned_docs[i].metadata)

SOURCE: demo_docs/doc1.txt

--- BEFORE (first 250 chars) ---
 ACME SUPPORT PORTAL — INTERNAL USE ONLY
------------------------------------------

Hello   team,

This   is    a    test document.  

It contains    extra spaces,    odd line breaks,
and some unicode like café, naïve, and “smart quotes”.

ACME SUPPO

--- AFTER  (first 250 chars) ---
 Hello team,

This is a test document.

It contains extra spaces, odd line breaks,
and some unicode like café, naïve, and “smart quotes”.

Page 1

METADATA: {'source': 'demo_docs/doc1.txt', 'cleaning_version': 'v1.0', 'original_char_count': 287, 'clean_char_count': 144}


## 12) Quick quality checks (recommended)

Before you proceed to chunking/embeddings (next notebook sections), do checks like:
- How many empty docs after cleaning?
- Average size reduced?
- Any doc became too small?

In [12]:
def stats(documents: List[Document]) -> Dict[str, Any]:
    lengths = [len(d.page_content) for d in documents]
    return {
        "count": len(documents),
        "empty_count": sum(1 for l in lengths if l == 0),
        "min_chars": min(lengths) if lengths else 0,
        "max_chars": max(lengths) if lengths else 0,
        "avg_chars": sum(lengths) / len(lengths) if lengths else 0,
    }

print("RAW STATS   :", stats(docs))
print("CLEAN STATS :", stats(cleaned_docs))

RAW STATS   : {'count': 3, 'empty_count': 0, 'min_chars': 205, 'max_chars': 349, 'avg_chars': 280.3333333333333}
CLEAN STATS : {'count': 3, 'empty_count': 0, 'min_chars': 12, 'max_chars': 144, 'avg_chars': 99.66666666666667}


## 13) What you achieved (and what comes next)

✅ You now have:
- A folder ingested into LangChain `Document` objects
- Metadata preserved for tracing and filtering
- A practical cleaning pipeline you can reuse in real projects

Next steps (when you’re ready):
- Chunking / splitting (token-aware)
- Embeddings + Vector DB insertion
- Retrieval + rerank