Skip to content

[Bug]: Metadata prepending is lost for subsequent documents when multiple small files are merged into a single chunk #2204

@gona-sreelatha

Description

@gona-sreelatha

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When using GraphRAG chunking with prepend_metadata: true, metadata (e.g., title) is correctly prepended to chunks only for the first document, but gets dropped for subsequent documents if multiple small documents are grouped into a single chunk.

This happens when individual document contents are smaller than the configured chunk size, causing GraphRAG to merge multiple documents into one chunk.

Current behavior:

  1. Multiple documents with small content are combined into a single chunk.
  2. Metadata (such as title) is prepended only once, corresponding to the first document.
  3. Content from subsequent documents appears in the same chunk without their associated metadata.
  4. As a result, chunk-to-document attribution becomes ambiguous or incorrect.

Expected behavior:
One of the following (or an equivalent deterministic behavior):

  • Each document’s metadata should be prepended before its respective content, even if multiple documents share the same chunk.
    OR
  • Documents should not be merged into a single chunk when prepend_metadata: true, ensuring metadata consistency.

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

    requests_per_minute: 200            # auto = 1 rpm. set to null to disable rate limiting

### Input settings ###
input:
  storage:
    type: file # or blob
    base_dir: PLACEHOLDER
  file_pattern: PLACEHOLDER
  metadata:
    - title

chunks:
  size: PLACEHOLDER
  overlap: PLACEHOLDER
  encoding_model: cl100k_base
  group_by_columns: [id]
  prepend_metadata: true

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

@natoverse

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageDefault label assignment, indicates new issue needs reviewed by a maintainer

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions