Skip to content

fix: handle long text in embedding requests properly#90

Merged
yingapple merged 1 commit intomindverse:masterfrom
Airmomo:fix/long-text-embedding
Mar 28, 2025
Merged

fix: handle long text in embedding requests properly#90
yingapple merged 1 commit intomindverse:masterfrom
Airmomo:fix/long-text-embedding

Conversation

@Airmomo
Copy link
Copy Markdown
Contributor

@Airmomo Airmomo commented Mar 27, 2025

Issue

When processing long documents (>8000 characters), the embedding service truncates the content, resulting in loss of information and potentially inaccurate semantic search results. The issue occurs in document_service.py:

# Current implementation simply truncates content
content = document.raw_content[:8000] if len(document.raw_content) > 8000 else document.raw_content

Cause analysis

  1. The original code had two main issues:
# In document_service.py - content was truncated
content = document.raw_content[:8000]

# In llm.py - no handling for long texts
data = {"input": texts, "model": user_llm_config.embedding_model_name}
  1. The problems with this approach:
  • Information loss: Content beyond 8000 characters is completely ignored
  • Semantic incompleteness: Truncated text may lose critical context
  • Hard-coded limits: No flexibility to adjust text length limit
  • Potential API failures: Long texts could cause embedding API errors

Fix

  1. Added configuration for maximum text length:
# Added to .env
EMBEDDING_MAX_TEXT_LENGTH=8000
  1. Implemented text chunking in LLMClient:
class LLMClient:
    def __init__(self):
        self.embedding_max_text_length = int(os.getenv('EMBEDDING_MAX_TEXT_LENGTH', 8000))

    def get_embedding(self, texts):
        # Split long texts into chunks
        for text in texts:
            if len(text) > self.embedding_max_text_length:
                chunks = [text[i:i + self.embedding_max_text_length] 
                         for i in range(0, len(text), self.embedding_max_text_length)]
                # ... process chunks and average embeddings
  1. Removed content truncation from document_service.py

Improvements

  1. Complete Content Processing:

    • All text content is now processed, no information loss
    • Long texts are properly chunked and processed in parts
  2. Better Configuration:

    • Text length limit is now configurable via environment variable
    • Default behavior maintains backward compatibility
  3. More Robust:

    • Prevents API failures due to text length

…hen requesting embeddings, which led to loss of information.

    This fix:
    - Add text chunking in LLMClient to handle long texts
    - Remove truncation in document_service.py
    - Add EMBEDDING_MAX_TEXT_LENGTH config to control chunk size
    - Average embeddings of chunks to maintain semantic representation

    The fix ensures that:
    1. No content is lost for long texts
    2. API requests don't fail due to text length
    3. Complete semantic information is preserved
Copy link
Copy Markdown
Contributor

@yingapple yingapple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution.
Very useful fix.

@yingapple yingapple merged commit 4fdc082 into mindverse:master Mar 28, 2025
1 check passed
Heterohabilis pushed a commit to Heterohabilis/Second-Me that referenced this pull request May 29, 2025
…hen requesting embeddings, which led to loss of information. (mindverse#90)

This fix:
    - Add text chunking in LLMClient to handle long texts
    - Remove truncation in document_service.py
    - Add EMBEDDING_MAX_TEXT_LENGTH config to control chunk size
    - Average embeddings of chunks to maintain semantic representation

    The fix ensures that:
    1. No content is lost for long texts
    2. API requests don't fail due to text length
    3. Complete semantic information is preserved
EOMZON pushed a commit to EOMZON/Second-Me that referenced this pull request Feb 1, 2026
…hen requesting embeddings, which led to loss of information. (mindverse#90)

This fix:
    - Add text chunking in LLMClient to handle long texts
    - Remove truncation in document_service.py
    - Add EMBEDDING_MAX_TEXT_LENGTH config to control chunk size
    - Average embeddings of chunks to maintain semantic representation

    The fix ensures that:
    1. No content is lost for long texts
    2. API requests don't fail due to text length
    3. Complete semantic information is preserved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants