fix: handle long text in embedding requests properly by Airmomo · Pull Request #90 · mindverse/Second-Me

Airmomo · 2025-03-27T15:22:13Z

Issue

When processing long documents (>8000 characters), the embedding service truncates the content, resulting in loss of information and potentially inaccurate semantic search results. The issue occurs in document_service.py:

# Current implementation simply truncates content
content = document.raw_content[:8000] if len(document.raw_content) > 8000 else document.raw_content

Cause analysis

The original code had two main issues:

# In document_service.py - content was truncated
content = document.raw_content[:8000]

# In llm.py - no handling for long texts
data = {"input": texts, "model": user_llm_config.embedding_model_name}

The problems with this approach:

Information loss: Content beyond 8000 characters is completely ignored
Semantic incompleteness: Truncated text may lose critical context
Hard-coded limits: No flexibility to adjust text length limit
Potential API failures: Long texts could cause embedding API errors

Fix

Added configuration for maximum text length:

# Added to .env
EMBEDDING_MAX_TEXT_LENGTH=8000

Implemented text chunking in LLMClient:

class LLMClient:
    def __init__(self):
        self.embedding_max_text_length = int(os.getenv('EMBEDDING_MAX_TEXT_LENGTH', 8000))

    def get_embedding(self, texts):
        # Split long texts into chunks
        for text in texts:
            if len(text) > self.embedding_max_text_length:
                chunks = [text[i:i + self.embedding_max_text_length] 
                         for i in range(0, len(text), self.embedding_max_text_length)]
                # ... process chunks and average embeddings

Removed content truncation from document_service.py

Improvements

Complete Content Processing:
- All text content is now processed, no information loss
- Long texts are properly chunked and processed in parts
Better Configuration:
- Text length limit is now configurable via environment variable
- Default behavior maintains backward compatibility
More Robust:
- Prevents API failures due to text length

…hen requesting embeddings, which led to loss of information. This fix: - Add text chunking in LLMClient to handle long texts - Remove truncation in document_service.py - Add EMBEDDING_MAX_TEXT_LENGTH config to control chunk size - Average embeddings of chunks to maintain semantic representation The fix ensures that: 1. No content is lost for long texts 2. API requests don't fail due to text length 3. Complete semantic information is preserved

yingapple

Thanks for contribution.
Very useful fix.

…hen requesting embeddings, which led to loss of information. (mindverse#90) This fix: - Add text chunking in LLMClient to handle long texts - Remove truncation in document_service.py - Add EMBEDDING_MAX_TEXT_LENGTH config to control chunk size - Average embeddings of chunks to maintain semantic representation The fix ensures that: 1. No content is lost for long texts 2. API requests don't fail due to text length 3. Complete semantic information is preserved

kevin-mindverse requested review from kevin-mindverse and yingapple March 28, 2025 01:33

yingapple approved these changes Mar 28, 2025

View reviewed changes

kevin-mindverse approved these changes Mar 28, 2025

View reviewed changes

yingapple merged commit 4fdc082 into mindverse:master Mar 28, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle long text in embedding requests properly#90

fix: handle long text in embedding requests properly#90
yingapple merged 1 commit intomindverse:masterfrom
Airmomo:fix/long-text-embedding

Airmomo commented Mar 27, 2025

Uh oh!

yingapple left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Airmomo commented Mar 27, 2025

Issue

Cause analysis

Fix

Improvements

Uh oh!

yingapple left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants