Skip to content

feat: support server side auto embedding#159

Merged
Mini256 merged 22 commits intomainfrom
support-server-side-auto-embedding
Aug 8, 2025
Merged

feat: support server side auto embedding#159
Mini256 merged 22 commits intomainfrom
support-server-side-auto-embedding

Conversation

@Mini256
Copy link
Copy Markdown
Member

@Mini256 Mini256 commented Jul 28, 2025

After this PR, server-side auto embedding will be enabled by default for EmbeddingFunction. Embeddings will be computed automatically on the database side.

Currently, server-side auto embedding has a few limitations:

  • Function-level API key configuration is not supported, only support global api key config.
  • Image input (multimodal) is not supported for embedding

Usage Example

By default, EmbeddingFunction uses server-side automatic embedding:

from pytidb.embeddings import EmbeddingFunction

text_embed = EmbeddingFunction(
    "openai/text-embedding-3-small",
    # use_server=True (default)
)

However, TiDB Serverless currently does not support multimodal inputs, so for multimodal embedding models, client-side embedding should be used instead.

To indicate that the embedding model supports multimodal input, set multimodal=True:

image_embed = EmbeddingFunction(
    "jina_ai/jina-embedding-v4",
    multimodal=True
)

Example Integration with TableModel:

from app.db import tidb_client
from pytidb.embeddings import EmbeddingFunction
from pytidb.schema import TableModel, Field

# Set API key globally
tidb_client.configure_embedding_provider("openai", os.getenv("OPENAI_API_KEY"))

# Define embedding function
text_embed = EmbeddingFunction("openai/text-embedding-3-small")

# Define table schema with auto embedding config.
class Chunk(TableModel):
    id: int = Field(primary_key=True)
    text: str = Field()
    text_vec: Optional[list[float]] = text_embed.VectorField(source_field="text")

# Create table
tbl = tidb_client.create_table(schema=Chunk, if_exists="overwrite")

# Insert data
tbl.insert(Chunk(id=1, text="foo"))

# Search
results = tbl.search("bar").limit(1).to_pydantic(with_score=True)

@Mini256 Mini256 marked this pull request as ready for review July 28, 2025 08:02
@Mini256 Mini256 marked this pull request as draft July 28, 2025 08:03
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugbot free trial expires on July 29, 2025
Learn more in the Cursor dashboard.

@Mini256 Mini256 marked this pull request as ready for review July 29, 2025 02:25
@cursor
Copy link
Copy Markdown

cursor bot commented Jul 29, 2025

🚨 Bugbot Trial Expired

Your Bugbot trial has expired. Please purchase a license in the Cursor dashboard to continue using Bugbot.

@cursor
Copy link
Copy Markdown

cursor bot commented Jul 29, 2025

Bugbot found 3 bugs

To see them, activate your membership in your Cursor dashboard.

@Mini256 Mini256 changed the title Support server side auto embedding feat: support server side auto embedding Aug 7, 2025
…to_embedding for dynamic model configuration
Copy link
Copy Markdown
Member

@Icemap Icemap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@breezewish
Copy link
Copy Markdown
Member

breezewish commented Aug 7, 2025

Just curious, for this example, what will happen if I don't keep the text_embed global instance at all?

For example, by using like this:

from app.db import tidb_client
from pytidb.embeddings import EmbeddingFunction
from pytidb.schema import TableModel, Field

# Set API key globally
tidb_client.configure_embedding_provider("openai", os.getenv("OPENAI_API_KEY"))

# Define table schema with auto embedding config.
class Chunk(TableModel):
    id: int = Field(primary_key=True)
    text: str = Field()
    text_vec: Optional[list[float]] = EmbeddingFunction("openai/text-embedding-3-small").VectorField(source_field="text")

# Create table
tbl = tidb_client.create_table(schema=Chunk, if_exists="overwrite")

# Insert data
tbl.insert(Chunk(id=1, text="foo"))

# Search
results = tbl.search("bar").limit(1).to_pydantic(with_score=True)

Will there be any downsides? Seems like in this way the EmbeddingFunction could more bounded with the Schema.

Also, when user wants two tables, do they need to define two embedding functions or they can reuse one?

@Mini256 Mini256 merged commit 252d00f into main Aug 8, 2025
3 checks passed
@Mini256 Mini256 deleted the support-server-side-auto-embedding branch August 8, 2025 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants