---------------------
#### Chunking methods
--------------------


#### 1. By Pages
- **Description**: Each page of the PDF is treated as a separate chunk.
- **Advantages**:
  - Simple and intuitive.
  - Retains logical divisions (e.g., tables, figures, and headers).
- **Disadvantages**:
  - Pages with too much or too little text might lead to uneven chunks.
- **Use Case**: Documents with well-structured layouts (e.g., reports, slide decks).

#### 2. By Paragraphs
- **Description**: Use paragraph delimiters (e.g., `\n\n` or indentation) to split text.
- **Advantages**:
  - Ensures semantic coherence.
  - Works well for prose-heavy documents like books or articles.
- **Disadvantages**:
  - Some PDFs may lack explicit paragraph markers, requiring preprocessing.
- **Use Case**: Articles, white papers, legal documents.

#### 3. By Headings or Sections
- **Description**: Split text based on headings, subheadings, or section titles detected via styles or keywords.
- **Advantages**:
  - Logical organization of content.
  - Preserves context for each section.
- **Disadvantages**:
  - Requires robust heading detection, which might be complex in poorly formatted PDFs.
- **Use Case**: Technical manuals, research papers.


#### 4. Fixed Token Count
- **Description**: Divide text into chunks of a fixed number of tokens (e.g., 512 tokens), considering model limits.
- **Advantages**:
  - Ensures chunks are optimized for LLM processing.
- **Disadvantages**:
  - May split sentences or paragraphs awkwardly, losing semantic coherence.
- **Use Case**: Generic documents for embedding or retrieval.

#### 5. Fixed Character Count
- **Description**: Split text into chunks of a fixed number of characters (e.g., 1000 characters).
- **Advantages**:
  - Simple to implement and suitable for token-agnostic systems.
- **Disadvantages**:
  - Splits may occur mid-sentence, losing meaning.
- **Use Case**: Quick prototyping or when tokenization is unavailable.

#### 6. Semantic Chunking
- **Description**: Use natural language processing (NLP) techniques to split text based on semantic boundaries (e.g., using BERT, spaCy, or sentence transformers).
- **Advantages**:
  - Retains meaning and context.
- **Disadvantages**:
  - Computationally expensive.
- **Use Case**: When high accuracy and coherence are required.

#### 7. By Sentences
- **Description**: Use sentence delimiters (e.g., `.` or `?`) to split text.
- **Advantages**:
  - Simple and effective for preserving coherence.
- **Disadvantages**:
  - Short sentences might create too many small chunks.
- **Use Case**: Conversational or narrative documents.

#### 8. By Tables or Figures
- **Description**: Extract tables or figures separately as individual chunks.
- **Advantages**:
  - Maintains logical grouping for structured data.
- **Disadvantages**:
  - Requires robust parsing of PDF layouts.
- **Use Case**: Financial reports, scientific papers.

#### 9. Hybrid Chunking
- **Description**: Combine multiple strategies, e.g., by pages first, then split long pages by headings or token limits.
- **Advantages**:
  - Flexible and customizable for complex documents.
- **Disadvantages**:
  - Requires careful implementation to avoid overlap or gaps.
- **Use Case**: Mixed-content PDFs with text, tables, and images.

#### 10. Overlapping Windows
- **Description**: Create overlapping chunks of text (e.g., a sliding window approach with 20% overlap).
- **Advantages**:
  - Preserves context at chunk boundaries.
- **Disadvantages**:
  - Increases the number of chunks and computational cost.
- **Use Case**: When context continuity is critical (e.g., narrative or dialog-heavy text).


#### 11. By Semantic Similarity
- **Description**: Group sentences or paragraphs with similar topics or themes using clustering algorithms (e.g., K-Means).
- **Advantages**:
  - Ensures semantically related text is grouped.
- **Disadvantages**:
  - Requires embedding generation and clustering, which can be resource-intensive.
- **Use Case**: FAQs, knowledge bases.

#### 12. By Visual Layout
- **Description**: Use the visual layout (e.g., columns, font size) to chunk content.
- **Advantages**:
  - Retains the document's intended structure.
- **Disadvantages**:
  - Requires advanced PDF parsing tools (e.g., PyMuPDF, PDFPlumber).
- **Use Case**: Magazines, newsletters.

#### 13. By Metadata Tags
- **Description**: Use metadata tags in the PDF (e.g., XML tags in PDFs with structured content).
- **Advantages**:
  - Leverages inherent structure.
- **Disadvantages**:
  - Not all PDFs have useful metadata.
- **Use Case**: Tagged PDFs or XML-based PDF exports.

#### 14. Hierarchical Chunking
- **Description**: First split by sections, then further split by paragraphs or token limits.
- **Advantages**:
  - Multi-level structure preserves context and granularity.
- **Disadvantages**:
  - Implementation complexity.
- **Use Case**: Books, technical documents.

#### Tool Support for PDF Chunking
- **Libraries**:
  - `PyPDF2`, `PyMuPDF`, `PDFPlumber` (for text extraction).
  - `spaCy`, `NLTK` (for NLP-based chunking).
  - `LangChain` (has built-in support for document chunking).
- **Pre-trained Models**:
  - Use BERT, Sentence-BERT, or GPT embeddings for semantic chunking.