Skip to content

feat: tests and example of custom embedding function#167

Merged
Mini256 merged 13 commits intomainfrom
feat-custom-embed-func
Aug 8, 2025
Merged

feat: tests and example of custom embedding function#167
Mini256 merged 13 commits intomainfrom
feat-custom-embed-func

Conversation

@Icemap
Copy link
Copy Markdown
Member

@Icemap Icemap commented Aug 7, 2025

No description provided.

- **First Run**: Model download and loading may take a few minutes
- **GPU Acceleration**: BGE-M3 will automatically use GPU if available
- **Memory Usage**: BGE-M3 requires ~2GB GPU memory or ~4GB RAM
- **Batch Size**: Larger batches improve throughput but require more memory
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn’t find a parameter to control the batch size in current code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means you can input like 5 items into get_source_embeddings() function in one go. It gives you better throughput but higher memory usage.

for automatic embedding generation and vector search capabilities.
```

## Understanding the Code
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can be removed as we can go to pytidb's documentation to explain how to go about defining a custom function step by step.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we? If a user goes here directly, at least they will have an overview of how many functions they should overwrite to make an embedding function class.

Copy link
Copy Markdown
Member

@Mini256 Mini256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encounter an error:

Traceback (most recent call last):
File "/Users/xxxx/Projects/pytidb/examples/custom_embedding_function/main.py", line 25, in
embed_func = BGEM3EmbeddingFunction()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Projects/pytidb/examples/custom_embedding_function/custom_embedding.py", line 46, in init
self._init_model()
File "/Users/xxxx/Projects/pytidb/examples/custom_embedding_function/custom_embedding.py", line 68, in _init_model
actual_dims = test_output["dense_vecs"].shape[1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: tuple index out of range

@Icemap
Copy link
Copy Markdown
Member Author

Icemap commented Aug 7, 2025

actual_dims = test_output["dense_vecs"].shape[1]

Removed the dimensions' judge logic due to the BGE-M3 only has one fixed dimension, which is 1024.

@Icemap
Copy link
Copy Markdown
Member Author

Icemap commented Aug 7, 2025

And another thing is, it seems like we ran out of Jina AI's token:

ERROR tests/test_auto_embedding_image.py::test_image_search_with_query_text - litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Jina_aiException - litellm.Timeout: Connection timed out after None seconds.
ERROR tests/test_auto_embedding_image.py::test_image_search_with_image_path - litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Jina_aiException - litellm.Timeout: Connection timed out after None seconds.
ERROR tests/test_auto_embedding_image.py::test_image_search_with_pil_image - litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Jina_aiException - litellm.Timeout: Connection timed out after None seconds.

Icemap and others added 5 commits August 8, 2025 12:38
Co-authored-by: Mini256 <minianter@foxmail.com>
Co-authored-by: Mini256 <minianter@foxmail.com>
Co-authored-by: Mini256 <minianter@foxmail.com>
import dotenv
from custom_embedding import BGEM3EmbeddingFunction
from pytidb.schema import TableModel, Field
from pytidb.datatype import Text
Copy link
Copy Markdown
Member

@Mini256 Mini256 Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure you have upgraded to pytidb==0.0.11, because in the new version, Text has been replaced with TEXT.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me delete the venv environment and reinstall it again.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed:

  1. Text to TEXT
  2. create_table(..., mode='overwrite') to create_table(..., mode='overwrite')

And ran main.py successfully.

Copy link
Copy Markdown
Member

@Mini256 Mini256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Mini256 Mini256 merged commit 828a34a into main Aug 8, 2025
3 checks passed
@Mini256 Mini256 deleted the feat-custom-embed-func branch August 8, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants