✅ go into a folder (like data/)
✅ find files (pdf, txt, csv, xlsx, docx, json)
✅ read them
✅ convert them into LangChain Document objects
✅ return one big list: documents
========================================================
That list is what you later split into chunks → create embeddings → store in FAISS/Chroma.

1) What is a “Document” in LangChain?

A LangChain Document is basically:

page_content → the text

metadata → extra info (file name, page number, etc.)

Example (conceptually):

Document(
  page_content="Sheikh Mujibur Rahman was ...",
  metadata={"source": "data/Sheikh_Mujibur_Rahman.pdf", "page": 3}
)


For a PDF, you usually get one Document per page.

So if your PDF has 51 pages:
➡️ you get 51 Document objects.
==============================================
2) What your function does (high level)
Function name
def load_all_documents(data_dir: str) -> List[Any]:


data_dir is the folder name ("data")

It returns List[Any] (a list of documents).
(Better: List[Document], but beginner-friendly is fine.)

==============================================
3) Path and why you use it
data_path = Path(data_dir).resolve()


Path("data") means “a folder named data”.
.resolve() converts it to the full absolute path.

Example:

you pass "data"

it becomes something like:
C:\Users\miraa\workspace\RAG\RAG_ONE\data

That’s why you print:

print(f"[DEBUG] Data path: {data_path}")


So you can confirm you’re loading from the correct place.

4) How it finds files (the most important line)

Example for PDFs:

pdf_files = list(data_path.glob('**/*.pdf'))

What glob('**/*.pdf') means

*.pdf = any file ending with .pdf

**/ = search recursively (inside subfolders too)

So it finds:

data/a.pdf

data/books/history.pdf

data/pdf/Sheikh_Mujibur_Rahman.pdf

Everything.

Same idea for txt/csv/xlsx/docx/json.

5) How it loads each file type
PDF part
loader = PyPDFLoader(str(pdf_file))
loaded = loader.load()
documents.extend(loaded)


What happens:

Create a loader for the PDF

load() reads it

loaded is a list of Documents (often 1 per page)

.extend() adds all of them into your main list

✅ Example:
If PDF = 51 pages
loaded length = 51
documents grows by 51

TXT part
loader = TextLoader(str(txt_file))
loaded = loader.load()


Usually TXT becomes one Document total.

✅ Example:
notes.txt → 1 Document

CSV part
loader = CSVLoader(str(csv_file))
loaded = loader.load()


Depends on loader settings, but often it makes:

1 Document per row, or

1 document with the whole file

So if your CSV has 500 rows, you might get 500 Documents.

Excel part
loader = UnstructuredExcelLoader(str(xlsx_file))
loaded = loader.load()


Excel is tricky:

It tries to extract text from sheets and cells

Output can be messy but works for many cases

If you only have tables, sometimes a better choice is to convert to CSV first.

Word docx part
loader = Docx2txtLoader(str(docx_file))
loaded = loader.load()


Usually 1 Document.

JSON part
loader = JSONLoader(str(json_file))
loaded = loader.load()


⚠️ This is the one that often fails for beginners.

Because many JSON loaders need to know which key to extract text from.

Example JSON:

[
  {"title": "A", "content": "hello"},
  {"title": "B", "content": "world"}
]


You usually must tell it: use content.

So in real life you often do something like (depends on LangChain version):

JSONLoader(
  file_path,
  jq_schema=".[] | .content",
  text_content=False
)


If your JSON part crashes, it’s usually because of this.

6) Why you use try/except everywhere

Example:

try:
    loader = PyPDFLoader(str(pdf_file))
    loaded = loader.load()
    documents.extend(loaded)
except Exception as e:
    print(f"[ERROR] Failed to load PDF {pdf_file}: {e}")


Reason:

One bad file should NOT crash your whole pipeline.

You want it to continue loading other files.

✅ Good practice.

7) Why print so many debug statements?

Because in ingestion pipelines, beginners get stuck on:

wrong folder path

files not found

one loader failing silently

empty documents

Your debug prints help you see:

how many files found

which file is being loaded

how many docs came from each file

total docs at end

That’s perfect.

8) What will docs look like in your code?

You do:

docs = load_all_documents("data")
print("Example document:", docs[0])


A sample docs[0] could print like:

page_content='...'

metadata={'source': '...pdf', 'page': 0}

That metadata is SUPER important later for citations.

9) Beginner examples to understand the output
Example A: Only 1 TXT file

data/a.txt contains: “Hello world”

Result:

docs length = 1

docs[0].page_content = “Hello world”

docs[0].metadata["source"] might contain the file path

Example B: 2 PDFs (your case)

history.pdf = 51 pages

mujib.pdf = 43 pages

Result:

total docs = 51 + 43 = 94 ✅ (exactly what you saw)

Example C: PDF inside subfolder

If you have:
data/pdf/book.pdf

Your glob finds it because of **/*.pdf

10) Important improvement (so your FAISS results show file/page)

Right now, many loaders store metadata keys like:

"source" not "source_file"

"page" for pdf

To make everything consistent, after loading each doc you can set:

for d in loaded:
    d.metadata["source_file"] = pdf_file.name


Do that for each file type too.

This way your search results always show:

source_file

file_type

page (for pdf)

PDF Loader

In [None]:
from pathlib import Path
from typing import List, Any
from langchain_community.document_loaders import PyPDFLoader, TextLoader, CSVLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders.excel import UnstructuredExcelLoader
from langchain_community.document_loaders import JSONLoader

def load_all_documents(data_dir: str) -> List[Any]:
    """
    Load all supported files from the data directory and convert to LangChain document structure.
    Supported: PDF, TXT, CSV, Excel, Word, JSON
    """
    # Use project root data folder
    data_path = Path(data_dir).resolve()
    print(f"[DEBUG] Data path: {data_path}")
    documents = []

    # PDF files
    pdf_files = list(data_path.glob('**/*.pdf'))
    print(f"[DEBUG] Found {len(pdf_files)} PDF files: {[str(f) for f in pdf_files]}")
    for pdf_file in pdf_files:
        print(f"[DEBUG] Loading PDF: {pdf_file}")
        try:
            loader = PyPDFLoader(str(pdf_file))
            loaded = loader.load()
            print(f"[DEBUG] Loaded {len(loaded)} PDF docs from {pdf_file}")
            documents.extend(loaded)
        except Exception as e:
            print(f"[ERROR] Failed to load PDF {pdf_file}: {e}")

    # TXT files
    txt_files = list(data_path.glob('**/*.txt'))
    print(f"[DEBUG] Found {len(txt_files)} TXT files: {[str(f) for f in txt_files]}")
    for txt_file in txt_files:
        print(f"[DEBUG] Loading TXT: {txt_file}")
        try:
            loader = TextLoader(str(txt_file))
            loaded = loader.load()
            print(f"[DEBUG] Loaded {len(loaded)} TXT docs from {txt_file}")
            documents.extend(loaded)
        except Exception as e:
            print(f"[ERROR] Failed to load TXT {txt_file}: {e}")

    # CSV files
    csv_files = list(data_path.glob('**/*.csv'))
    print(f"[DEBUG] Found {len(csv_files)} CSV files: {[str(f) for f in csv_files]}")
    for csv_file in csv_files:
        print(f"[DEBUG] Loading CSV: {csv_file}")
        try:
            loader = CSVLoader(str(csv_file))
            loaded = loader.load()
            print(f"[DEBUG] Loaded {len(loaded)} CSV docs from {csv_file}")
            documents.extend(loaded)
        except Exception as e:
            print(f"[ERROR] Failed to load CSV {csv_file}: {e}")

    # Excel files
    xlsx_files = list(data_path.glob('**/*.xlsx'))
    print(f"[DEBUG] Found {len(xlsx_files)} Excel files: {[str(f) for f in xlsx_files]}")
    for xlsx_file in xlsx_files:
        print(f"[DEBUG] Loading Excel: {xlsx_file}")
        try:
            loader = UnstructuredExcelLoader(str(xlsx_file))
            loaded = loader.load()
            print(f"[DEBUG] Loaded {len(loaded)} Excel docs from {xlsx_file}")
            documents.extend(loaded)
        except Exception as e:
            print(f"[ERROR] Failed to load Excel {xlsx_file}: {e}")

    # Word files
    docx_files = list(data_path.glob('**/*.docx'))
    print(f"[DEBUG] Found {len(docx_files)} Word files: {[str(f) for f in docx_files]}")
    for docx_file in docx_files:
        print(f"[DEBUG] Loading Word: {docx_file}")
        try:
            loader = Docx2txtLoader(str(docx_file))
            loaded = loader.load()
            print(f"[DEBUG] Loaded {len(loaded)} Word docs from {docx_file}")
            documents.extend(loaded)
        except Exception as e:
            print(f"[ERROR] Failed to load Word {docx_file}: {e}")

    # JSON files
    json_files = list(data_path.glob('**/*.json'))
    print(f"[DEBUG] Found {len(json_files)} JSON files: {[str(f) for f in json_files]}")
    for json_file in json_files:
        print(f"[DEBUG] Loading JSON: {json_file}")
        try:
            loader = JSONLoader(str(json_file))
            loaded = loader.load()
            print(f"[DEBUG] Loaded {len(loaded)} JSON docs from {json_file}")
            documents.extend(loaded)
        except Exception as e:
            print(f"[ERROR] Failed to load JSON {json_file}: {e}")

    print(f"[DEBUG] Total loaded documents: {len(documents)}")
    return documents

# Example usage
if __name__ == "__main__":
    docs = load_all_documents("data")
    print(f"Loaded {len(docs)} documents.")
    print("Example document:", docs[0] if docs else None)

In [None]:
from __future__ import annotations

from pathlib import Path
from typing import List, Optional, Any, Dict

from langchain_core.documents import Document

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    CSVLoader,
    Docx2txtLoader,
)
from langchain_community.document_loaders.excel import UnstructuredExcelLoader
from langchain_community.document_loaders import JSONLoader


# ----------------------------
# Helper: standardize metadata
# ----------------------------
def _standardize_metadata(
    docs: List[Document],
    file_path: Path,
    file_type: str,
) -> List[Document]:
    """
    Ensure every Document has consistent metadata keys.
    This is super important for citations in RAG (FAISS/Chroma).
    """
    for d in docs:
        d.metadata = d.metadata or {}

        # Common, consistent keys
        d.metadata["source_path"] = str(file_path)      # full path
        d.metadata["source_file"] = file_path.name      # file name only
        d.metadata["file_type"] = file_type             # pdf/txt/csv/xlsx/docx/json

        # Keep LangChain's original "source" if present; otherwise add it
        # Many loaders set metadata["source"] automatically, but not always.
        d.metadata.setdefault("source", str(file_path))

        # PDF loader usually sets "page". If not present, keep it as None.
        d.metadata.setdefault("page", None)

    return docs


# ----------------------------
# JSON loading (beginner-safe)
# ----------------------------
def load_json_documents(
    json_path: Path,
    # Choose ONE of these patterns depending on your JSON structure
    mode: str = "auto",
) -> List[Document]:
    """
    Load JSON in a safe way.

    JSON is tricky because it depends on structure.
    This function provides beginner-friendly options.

    mode options:
      1) "auto"        -> tries to load the file as-is (may fail for some JSON)
      2) "list_content"-> expects a JSON list of objects, each has a "content" field
      3) "dict_text"   -> expects a JSON dict with a "text" field
    """

    try:
        if mode == "auto":
            # NOTE: Some LangChain JSONLoader versions require jq_schema.
            # If this fails, use one of the explicit modes below.
            loader = JSONLoader(str(json_path))
            return loader.load()

        if mode == "list_content":
            # Example JSON:
            # [
            #   {"title": "A", "content": "hello"},
            #   {"title": "B", "content": "world"}
            # ]
            # jq_schema extracts each content field
            loader = JSONLoader(
                file_path=str(json_path),
                jq_schema=".[] | .content",
                text_content=False,
            )
            return loader.load()

        if mode == "dict_text":
            # Example JSON:
            # {"text": "This is the main text", "source": "x"}
            loader = JSONLoader(
                file_path=str(json_path),
                jq_schema=".text",
                text_content=False,
            )
            return loader.load()

        raise ValueError(f"Unknown JSON mode: {mode}")

    except Exception as e:
        print(f"[ERROR] Failed to load JSON {json_path}: {e}")
        return []


# ----------------------------
# One function to load anything
# ----------------------------
def _load_one_file(file_path: Path, json_mode: str = "auto") -> List[Document]:
    """
    Load a single file into a list of Documents, based on extension.
    """
    suffix = file_path.suffix.lower()

    try:
        if suffix == ".pdf":
            loader = PyPDFLoader(str(file_path))
            docs = loader.load()
            return _standardize_metadata(docs, file_path, "pdf")

        if suffix == ".txt":
            # encoding="utf-8" avoids many Windows encoding issues
            loader = TextLoader(str(file_path), encoding="utf-8")
            docs = loader.load()
            return _standardize_metadata(docs, file_path, "txt")

        if suffix == ".csv":
            loader = CSVLoader(str(file_path))
            docs = loader.load()
            return _standardize_metadata(docs, file_path, "csv")

        if suffix in [".xlsx", ".xls"]:
            loader = UnstructuredExcelLoader(str(file_path))
            docs = loader.load()
            return _standardize_metadata(docs, file_path, "excel")

        if suffix == ".docx":
            loader = Docx2txtLoader(str(file_path))
            docs = loader.load()
            return _standardize_metadata(docs, file_path, "docx")

        if suffix == ".json":
            docs = load_json_documents(file_path, mode=json_mode)
            return _standardize_metadata(docs, file_path, "json")

        # Unsupported extension
        return []

    except Exception as e:
        print(f"[ERROR] Failed to load {file_path}: {e}")
        return []


# ----------------------------
# Load everything from a folder
# ----------------------------
def load_all_documents(
    data_dir: str,
    recursive: bool = True,
    json_mode: str = "auto",
    allowed_exts: Optional[List[str]] = None,
) -> List[Document]:
    """
    Beginner-friendly "load everything" function.

    Parameters:
      data_dir     : folder name/path (example: "data")
      recursive    : True -> search subfolders too
      json_mode    : "auto" | "list_content" | "dict_text"
      allowed_exts : optional list like [".pdf", ".txt"] to limit file types

    Returns:
      List[Document]
    """
    data_path = Path(data_dir).resolve()
    print(f"[INFO] Data path: {data_path}")

    if not data_path.exists():
        raise FileNotFoundError(f"Data folder not found: {data_path}")

    if allowed_exts is None:
        allowed_exts = [".pdf", ".txt", ".csv", ".xlsx", ".xls", ".docx", ".json"]
    allowed_exts = [e.lower() for e in allowed_exts]

    pattern = "**/*" if recursive else "*"
    all_files = [p for p in data_path.glob(pattern) if p.is_file() and p.suffix.lower() in allowed_exts]

    print(f"[INFO] Found {len(all_files)} supported files")
    for p in all_files:
        print(f"  - {p.name}")

    documents: List[Document] = []

    for f in all_files:
        docs = _load_one_file(f, json_mode=json_mode)
        if docs:
            print(f"[INFO] Loaded {len(docs)} docs from {f.name}")
            documents.extend(docs)
        else:
            print(f"[WARN] No docs loaded from {f.name}")

    print(f"[INFO] Total loaded documents: {len(documents)}")
    return documents


# ----------------------------
# Quick test (run this file directly)
# ----------------------------
if __name__ == "__main__":
    docs = load_all_documents("data", json_mode="auto")
    print(f"\nLoaded {len(docs)} documents.")

    if docs:
        d0 = docs[0]
        print("\n--- Example Document ---")
        print("Text preview:", d0.page_content[:300])
        print("Metadata:", d0.metadata)
