feat: support server side auto embedding by Mini256 · Pull Request #159 · pingcap/pytidb

Mini256 · 2025-07-28T06:28:36Z

After this PR, server-side auto embedding will be enabled by default for EmbeddingFunction. Embeddings will be computed automatically on the database side.

Currently, server-side auto embedding has a few limitations:

Function-level API key configuration is not supported, only support global api key config.
Image input (multimodal) is not supported for embedding

Usage Example

By default, EmbeddingFunction uses server-side automatic embedding:

from pytidb.embeddings import EmbeddingFunction

text_embed = EmbeddingFunction(
    "openai/text-embedding-3-small",
    # use_server=True (default)
)

However, TiDB Serverless currently does not support multimodal inputs, so for multimodal embedding models, client-side embedding should be used instead.

To indicate that the embedding model supports multimodal input, set multimodal=True:

image_embed = EmbeddingFunction(
    "jina_ai/jina-embedding-v4",
    multimodal=True
)

Example Integration with TableModel:

from app.db import tidb_client
from pytidb.embeddings import EmbeddingFunction
from pytidb.schema import TableModel, Field

# Set API key globally
tidb_client.configure_embedding_provider("openai", os.getenv("OPENAI_API_KEY"))

# Define embedding function
text_embed = EmbeddingFunction("openai/text-embedding-3-small")

# Define table schema with auto embedding config.
class Chunk(TableModel):
    id: int = Field(primary_key=True)
    text: str = Field()
    text_vec: Optional[list[float]] = text_embed.VectorField(source_field="text")

# Create table
tbl = tidb_client.create_table(schema=Chunk, if_exists="overwrite")

# Insert data
tbl.insert(Chunk(id=1, text="foo"))

# Search
results = tbl.search("bar").limit(1).to_pydantic(with_score=True)

cursor

Bugbot free trial expires on July 29, 2025
Learn more in the Cursor dashboard.

pytidb/orm/distance_metric.py

pytidb/search.py

tests/conftest.py

pytidb/schema.py

…formatting - Changed `VectorType(dimensions)` to `VECTOR(dimensions)` for consistency. - Updated the query string formatting to use backticks around `source_field` in the `Computed` function.

cursor · 2025-07-29T02:25:58Z

🚨 Bugbot Trial Expired

Your Bugbot trial has expired. Please purchase a license in the Cursor dashboard to continue using Bugbot.

cursor · 2025-07-29T14:25:37Z

Bugbot found 3 bugs

To see them, activate your membership in your Cursor dashboard.

…VectorField

…sing get_model_dimensions

…to_embedding for dynamic model configuration

Icemap

LGTM

breezewish · 2025-08-07T16:30:37Z

Just curious, for this example, what will happen if I don't keep the text_embed global instance at all?

For example, by using like this:

from app.db import tidb_client
from pytidb.embeddings import EmbeddingFunction
from pytidb.schema import TableModel, Field

# Set API key globally
tidb_client.configure_embedding_provider("openai", os.getenv("OPENAI_API_KEY"))

# Define table schema with auto embedding config.
class Chunk(TableModel):
    id: int = Field(primary_key=True)
    text: str = Field()
    text_vec: Optional[list[float]] = EmbeddingFunction("openai/text-embedding-3-small").VectorField(source_field="text")

# Create table
tbl = tidb_client.create_table(schema=Chunk, if_exists="overwrite")

# Insert data
tbl.insert(Chunk(id=1, text="foo"))

# Search
results = tbl.search("bar").limit(1).to_pydantic(with_score=True)

Will there be any downsides? Seems like in this way the EmbeddingFunction could more bounded with the Schema.

Also, when user wants two tables, do they need to define two embedding functions or they can reuse one?

…bedding for jina_ai model configuration

Mini256 added 3 commits July 26, 2025 18:22

feat: add tidb dialect

3d19251

add generate column

4f4d3bb

Merge remote-tracking branch 'origin/main' into add-tidb-dialect

7e6c711

Mini256 marked this pull request as ready for review July 28, 2025 08:02

Mini256 marked this pull request as draft July 28, 2025 08:03

cursor bot reviewed Jul 28, 2025

View reviewed changes

pytidb/orm/distance_metric.py Show resolved Hide resolved

pytidb/search.py Outdated Show resolved Hide resolved

tests/conftest.py Outdated Show resolved Hide resolved

pytidb/schema.py Outdated Show resolved Hide resolved

Mini256 added 7 commits July 28, 2025 16:07

Merge branch 'main' into support-server-side-auto-embedding

1a7a38e

refactor: update VectorField to use VECTOR type and fix query string …

5ac8fb1

…formatting - Changed `VectorType(dimensions)` to `VECTOR(dimensions)` for consistency. - Updated the query string formatting to use backticks around `source_field` in the `Computed` function.

refine query_vector logic

7760649

refactor: remove TiDB dialect and related components

b63c015

revert: restore examples to use Text instead of TEXT

e1e3cfe

fix lint

f0f9028

feat: enhance auto embedding logic

19b3521

Mini256 marked this pull request as ready for review July 29, 2025 02:25

Mini256 added 2 commits July 29, 2025 16:34

refactor: useuse_server instead of embed_in_sql

2ea33ed

refine auto embedding testcase

ede569c

Mini256 and others added 8 commits July 30, 2025 11:55

tests: refine image auto embedding test cases

873a093

revert

07feee6

feat: support configure server_embed_params on EmbeddingFunction and …

09559b3

…VectorField

chore: increase test timeout from 15 to 30 minutes

b5aeebd

Merge branch 'main' into support-server-side-auto-embedding

b9b36c8

refactor: replace hardcoded model dimensions with dynamic retrieval u…

08e4179

…sing get_model_dimensions

Merge branch 'main' into support-server-side-auto-embedding

3d4d14d

feat: add embedding provider configuration

d6f5830

Mini256 changed the title ~~Support server side auto embedding~~ feat: support server side auto embedding Aug 7, 2025

Mini256 requested review from Icemap, breezewish and sykp241095 August 7, 2025 06:11

fix: update model names in KNOWN_MODEL_DIMENSIONS and enhance test_au…

8f86785

…to_embedding for dynamic model configuration

Icemap approved these changes Aug 7, 2025

View reviewed changes

fix: handle None case for server_embed_params and update test_auto_em…

3c6f9de

…bedding for jina_ai model configuration

Mini256 merged commit 252d00f into main Aug 8, 2025
3 checks passed

Mini256 deleted the support-server-side-auto-embedding branch August 8, 2025 03:37

This was referenced Aug 11, 2025

Provide default dimension mappings for common models in EmbeddingFunction #144

Closed

table.search(search_type="vector") support l1 distance and negative_inner_product metric #130

Closed

Server-Side Auto Embedding #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support server side auto embedding#159

feat: support server side auto embedding#159
Mini256 merged 22 commits intomainfrom
support-server-side-auto-embedding

Mini256 commented Jul 28, 2025 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot commented Jul 29, 2025

Uh oh!

cursor bot commented Jul 29, 2025

Uh oh!

Icemap left a comment

Uh oh!

breezewish commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Mini256 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage Example

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot commented Jul 29, 2025

🚨 Bugbot Trial Expired

Uh oh!

cursor bot commented Jul 29, 2025

Bugbot found 3 bugs

Uh oh!

Icemap left a comment

Choose a reason for hiding this comment

Uh oh!

breezewish commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mini256 commented Jul 28, 2025 •

edited

Loading

breezewish commented Aug 7, 2025 •

edited

Loading