Self Checks
Dify version
1.14.2
Cloud or Self Hosted
Cloud
Steps to reproduce
Environment
- Dify version: Cloud
- Embedding model:
text-embedding-3-large (same for both)
- API endpoint:
POST /v1/datasets/{dataset_id}/document/create-by-file
Steps to Reproduce
- Extract text from a Japanese PDF using PyMuPDF
- Fix spacing artifacts with GPT-4.1
- Method A (UI) — paste cleaned text into
.docx, upload via Dify UI
- Method B (API) — upload same text as
.txt via create-by-file with identical process_rule
data = {
"name": "manual",
"indexing_technique": "high_quality",
"doc_form": "hierarchical_model",
"process_rule": {
"mode": "hierarchical",
"rules": {
"parent_mode": "paragraph",
"segmentation": {"separator": "###", "max_tokens": 4000, "chunk_overlap": 0},
"subchunk_segmentation": {"separator": "$$$", "max_tokens": 400, "chunk_overlap": 50},
"pre_processing_rules": [
{"id": "remove_extra_spaces", "enabled": True},
{"id": "remove_urls_emails", "enabled": False}
]
}
}
}
- Query both datasets with the same question
- Compare retrieval scores
✔️ Expected Behavior
Same chunk content + same embedding model + same query = same retrieval score, regardless of upload method or file format.
❌ Actual Behavior
| Method |
Format |
Score |
| Dify UI |
.docx |
0.99 |
API create-by-file |
.txt |
0.69 |
Chunk content is confirmed identical in the Dify chunk viewer. Only the upload method differs.
Suspected Cause
The embedding vector stored for each chunk differs between the two paths. Likely causes:
Related Issues
#22270, #22118 — hierarchical chunking broken via API
#35860 — partial fix for automatic modeers.
Self Checks
Dify version
1.14.2
Cloud or Self Hosted
Cloud
Steps to reproduce
Environment
text-embedding-3-large(same for both)POST /v1/datasets/{dataset_id}/document/create-by-fileSteps to Reproduce
.docx, upload via Dify UI.txtviacreate-by-filewith identicalprocess_rule✔️ Expected Behavior
Same chunk content + same embedding model + same query = same retrieval score, regardless of upload method or file format.
❌ Actual Behavior
create-by-fileChunk content is confirmed identical in the Dify chunk viewer. Only the upload method differs.
Suspected Cause
The embedding vector stored for each chunk differs between the two paths. Likely causes:
remove_extra_spacespre-processing behaves differently for.txtvs.docxbefore the embedding call — the stored vector differs from the text visible in the UI.creat_fy_fileAPI silently fails to process the hierarchical model (parent-child segmentation) #22270, API: Parent-child chunking mode is not applied when creating documents using /create-by-text and /create-by-file endpoints #22118, fix(api): reuse hierarchical process rule when API upload uses mode=automatic #35860), causing a fallback to general chunking with different embedding behavior.Related Issues
#22270, #22118 — hierarchical chunking broken via API
#35860 — partial fix for automatic modeers.