Skip to content

Retrieval score differs between UI upload (.docx) and API upload (.txt) despite identical chunk content and embedding model #36799

@d5devgodai-blip

Description

@d5devgodai-blip

Self Checks

  • I have read the Contributing Guide and Language Policy.
  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report, otherwise it will be closed.
  • 【中文用户 & Non English User】请使用英语提交,否则会被关闭 :)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

1.14.2

Cloud or Self Hosted

Cloud

Steps to reproduce

Environment

  • Dify version: Cloud
  • Embedding model: text-embedding-3-large (same for both)
  • API endpoint: POST /v1/datasets/{dataset_id}/document/create-by-file

Steps to Reproduce

  1. Extract text from a Japanese PDF using PyMuPDF
  2. Fix spacing artifacts with GPT-4.1
  3. Method A (UI) — paste cleaned text into .docx, upload via Dify UI
  4. Method B (API) — upload same text as .txt via create-by-file with identical process_rule
data = {
    "name": "manual",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "hierarchical",
        "rules": {
            "parent_mode": "paragraph",
            "segmentation":          {"separator": "###", "max_tokens": 4000, "chunk_overlap": 0},
            "subchunk_segmentation": {"separator": "$$$", "max_tokens": 400,  "chunk_overlap": 50},
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": True},
                {"id": "remove_urls_emails",  "enabled": False}
            ]
        }
    }
}
  1. Query both datasets with the same question
  2. Compare retrieval scores

✔️ Expected Behavior

Same chunk content + same embedding model + same query = same retrieval score, regardless of upload method or file format.

❌ Actual Behavior

Method Format Score
Dify UI .docx 0.99
API create-by-file .txt 0.69

Chunk content is confirmed identical in the Dify chunk viewer. Only the upload method differs.

Suspected Cause

The embedding vector stored for each chunk differs between the two paths. Likely causes:

Related Issues

#22270, #22118 — hierarchical chunking broken via API
#35860 — partial fix for automatic modeers.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions