Refine index configuration validation and typed helpers by hnwyllmm · Pull Request #173 · oceanbase/pyseekdb

hnwyllmm · 2026-02-26T08:57:46Z

Summary

add typed HNSW/fulltext configuration helpers and export them from package entry points
enforce stricter parameter validation and normalization for dense/sparse/fulltext index configs while preserving forward-compatible pass-through for unknown analyzers/properties
update vector index SQL generation to honor configured type/lib and extend unit tests for new defaults and validation behavior

Test plan

PYTHONPATH=src pytest tests/unit_tests/test_configuration.py tests/unit_tests/test_sparse_vector_index_config.py
Run full unit/integration suite in CI

Made with Cursor

Summary by CodeRabbit

New Features
- Added specialized fulltext configs (IK, SPACE, BENG, NGRAM, NGRAM2) and BQ/SQ variants for HNSW vector indexes.
Improvements
- Stronger validation, normalization and safer defaults for index and sparse-vector configs (updated prune/refine/drop/refine_k defaults).
- HNSW config now exposes configurable type/lib and uses those values in index generation.
Breaking Changes
- Top-level export for the embedded client removed — import the client from the client module instead.
Chores
- Excluded a specific numpy release in dependency constraints.

Improve developer ergonomics with dedicated HNSW/fulltext config helpers, enforce documented parameter validation with clear warnings, and align SQL generation/exports/tests with the expanded configuration model. Made-with: Cursor

coderabbitai · 2026-02-26T08:58:05Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Expands and tightens configuration dataclasses/enums and validation, changes SparseVector defaults, makes VECTOR INDEX SQL use HNSW config fields, lazily exposes SeekdbEmbeddedClient from client package, updates tests and numpy constraint.

Changes

Cohort / File(s)	Summary
Top-level exports `src/pyseekdb/__init__.py`	Removed `SeekdbEmbeddedClient` from direct export; added re-exports for new config dataclasses and enums (IK*, SPACE, BENG, NGRAM, NGRAM2, Bq/Sq HNSW configs, etc.).
Client package surface `src/pyseekdb/client/__init__.py`	Added new public config/type exports; added TYPE_CHECKING import and lazy `SeekdbEmbeddedClient` exposure via `__getattr__` to defer import-time errors.
Client SQL generation `src/pyseekdb/client/client_base.py`	VECTOR INDEX SQL now derives TYPE and LIB from `hnsw_config.type` / `hnsw_config.lib` instead of hard-coded values.
Configuration core `src/pyseekdb/client/configuration.py`	Major refactor: introduced PrimitiveValue alias, normalization/validation helpers, new enums (HNSWIndexType, HNSWIndexLib, FulltextAnalyzer, IKMode, BQRefineType), typed FulltextIndexConfig variants (IK/SPACE/BENG/NGRAM/NGRAM2), HNSW variants (BqHNSWConfiguration, SqHNSWConfiguration), stronger SparseVectorIndexConfig defaults/validation, and a deprecated Configuration wrapper.
Tests — configuration `tests/unit_tests/test_configuration.py`	Updated tests for new config classes, HNSW `type`/`lib` handling, analyzer property validation, and adjusted expected SQL output.
Tests — sparse vector `tests/unit_tests/test_sparse_vector_index_config.py`, `tests/unit_tests/test_sparse_vector_sql.py`	Adjusted defaults/assertions for `SparseVectorIndexConfig` (prune/refine/drop ratios/refine_k) and relaxed one SQL assertion.
Integration test update `tests/integration_tests/test_admin_database_management.py`	Replaced module-qualified `pyseekdb.SeekdbEmbeddedClient` type check with direct import from `pyseekdb.client.client_seekdb_embedded`.
Project config `pyproject.toml`	Updated numpy constraint to exclude `2.4.0` (`>=1.26,!=2.4.0`).
CI workflow `.github/workflows/ci.yml`	Added environment variables for tests and a Python 3.14 sanity-check step for quick runtime checks.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

support more options of hnsw index config #166 — overlaps HNSW configuration validation and VECTOR INDEX SQL generation (type/lib/properties).
chore: intro ruff check and make it happy #125 — modifies public API exports and client surface (related exported symbols and client/init changes).
refactor: test suite and fix test infrastructure #91 — changes tests referencing SeekdbEmbeddedClient and embedded-client import/isinstance usage.

Poem

🐰 I hopped through enums and tidy keys,

I stitched HNSW types with gentle ease,
IK, Ngram, Space now line the trail,
A lazy client peeks when imports fail,
Configs bloom bright — a rabbit's happy squeal.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Refine index configuration validation and typed helpers' accurately summarizes the main change: adding validation, typed configuration helpers, and exporting them from package entry points.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unit_tests/test_configuration.py (1)

226-255: Consider adding deprecation warning assertion.

The Configuration class emits a DeprecationWarning when instantiated, but none of these tests assert that the warning is raised. Consider wrapping the Configuration() calls with pytest.warns(DeprecationWarning) to document and verify the expected deprecation behavior.

Example for one test case

     def test_configuration_empty(self):
         """Test Configuration with no parameters"""
-        config = Configuration()
+        with pytest.warns(DeprecationWarning, match="Configuration is deprecated"):
+            config = Configuration()
         assert config.hnsw is None
         assert config.fulltext_config is None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/test_configuration.py` around lines 226 - 255, Each test
should assert the deprecation warning emitted by instantiating Configuration:
wrap each instantiation of Configuration in the tests
(test_configuration_with_hnsw_only, test_configuration_with_fulltext_only,
test_configuration_with_both, test_configuration_empty) with
pytest.warns(DeprecationWarning) so the tests verify the DeprecationWarning
produced by the Configuration class upon construction.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit_tests/test_configuration.py`:
- Around line 226-255: Each test should assert the deprecation warning emitted
by instantiating Configuration: wrap each instantiation of Configuration in the
tests (test_configuration_with_hnsw_only, test_configuration_with_fulltext_only,
test_configuration_with_both, test_configuration_empty) with
pytest.warns(DeprecationWarning) so the tests verify the DeprecationWarning
produced by the Configuration class upon construction.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc1f203 and 2f784de.

📒 Files selected for processing (6)

src/pyseekdb/__init__.py
src/pyseekdb/client/__init__.py
src/pyseekdb/client/client_base.py
src/pyseekdb/client/configuration.py
tests/unit_tests/test_configuration.py
tests/unit_tests/test_sparse_vector_index_config.py

Exclude yanked numpy 2.4.0 and regenerate uv.lock (including pylibseekdb upgrade) so `make check` passes consistently with --locked. Made-with: Cursor

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/pyseekdb/__init__.py (1)
112-147: ⚠️ Potential issue | 🟠 Major

SeekdbEmbeddedClient is exported in __all__ but not actually importable from the top-level module.

__all__ includes "SeekdbEmbeddedClient" (line 135), but this module does not import it from .client and does not define __getattr__ to resolve it lazily. This breaks the expected import path from pyseekdb import SeekdbEmbeddedClient used in integration tests.

The symbol is available in the .client submodule via its __getattr__, but the top-level module needs its own __getattr__ to delegate to it.
Proposed fix
 from .client.collection import Collection
 
 try:
     __version__ = importlib.metadata.version("pyseekdb")
 except importlib.metadata.PackageNotFoundError:
     __version__ = "0.0.1.dev1"
 
 __author__ = "OceanBase <open_oceanbase@oceanbase.com>"
+
+
+def __getattr__(name: str):
+    if name == "SeekdbEmbeddedClient":
+        from .client import SeekdbEmbeddedClient
+        return SeekdbEmbeddedClient
+    raise AttributeError(f"module '{__name__}' has no attribute '{name}'")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyseekdb/__init__.py` around lines 112 - 147, The top-level __all__
exports "SeekdbEmbeddedClient" but the module doesn't expose it, causing imports
like from pyseekdb import SeekdbEmbeddedClient to fail; add a top-level
__getattr__ that delegates missing attribute lookups to the .client submodule
(or explicitly import SeekdbEmbeddedClient from .client into this module) so the
exported symbol is resolvable; reference the existing __all__ list, the symbol
SeekdbEmbeddedClient, and implement __getattr__(name) to import .client and
return getattr(client, name) (or perform a one-time import of
SeekdbEmbeddedClient) to restore the expected top-level import behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unit_tests/test_sparse_vector_sql.py`:
- Around line 55-58: The test test_no_optional_params_by_default currently
discards the result of _get_sparse_vector_index_sql(config); update it to assert
meaningful behavior by capturing the returned SQL and asserting either that it
equals the expected default SQL string or at minimum is non-empty and contains
expected tokens (e.g., "CREATE INDEX", the table/column names produced by
_make_config). Locate test_no_optional_params_by_default, call sql =
_get_sparse_vector_index_sql(config), and add assertions (e.g., assert sql
and/or assert "CREATE INDEX" in sql and/or assert the exact expected SQL) so
regressions in default SQL rendering are caught.

---

Outside diff comments:
In `@src/pyseekdb/__init__.py`:
- Around line 112-147: The top-level __all__ exports "SeekdbEmbeddedClient" but
the module doesn't expose it, causing imports like from pyseekdb import
SeekdbEmbeddedClient to fail; add a top-level __getattr__ that delegates missing
attribute lookups to the .client submodule (or explicitly import
SeekdbEmbeddedClient from .client into this module) so the exported symbol is
resolvable; reference the existing __all__ list, the symbol
SeekdbEmbeddedClient, and implement __getattr__(name) to import .client and
return getattr(client, name) (or perform a one-time import of
SeekdbEmbeddedClient) to restore the expected top-level import behavior.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8f872d and 5d9c6da.

📒 Files selected for processing (3)

src/pyseekdb/__init__.py
src/pyseekdb/client/__init__.py
tests/unit_tests/test_sparse_vector_sql.py

coderabbitai

🧹 Nitpick comments (2)

.github/workflows/ci.yml (2)

113-115: The grep chain logic is confusing and potentially fragile.

The expression grep 'passed' | grep -q 'failed' && exit 1 || exit 0 attempts to find lines containing both "passed" and "failed", but this is convoluted. With set -o pipefail, pytest's exit code should already propagate correctly.

If the intent is to catch summary lines like 5 passed, 2 failed, consider simplifying:

♻️ Suggested simplification

          set -o pipefail
-          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} | tee pytest.log
-          tail -n 1 pytest.log | grep '=======' | grep 'passed' |grep -q 'failed' && exit 1 || exit 0
+          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} --tb=short

Alternatively, if log capture is needed, rely on pytest's exit code directly:

          set -o pipefail
-          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} | tee pytest.log
-          tail -n 1 pytest.log | grep '=======' | grep 'passed' |grep -q 'failed' && exit 1 || exit 0
+          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} 2>&1 | tee pytest.log

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 113 - 115, The current CI step uses a
fragile grep chain after "set -o pipefail" and "uv run pytest ... | tee
pytest.log"—specifically the "tail -n 1 pytest.log | grep '=======' | grep
'passed' | grep -q 'failed' && exit 1 || exit 0" logic is confusing and
unnecessary; replace it by relying on pytest's exit code (remove the tail/grep
chain so the job simply exits with pytest's exit status when using set -o
pipefail), or if you must inspect the logfile, simplify to a single explicit
check like grepping pytest.log for "failed" and exiting non‑zero if found (refer
to the CI step that runs "uv run pytest ... | tee pytest.log" and the use of
"set -o pipefail").

124-126: Same fragile grep pattern as noted above.

This duplicates the convoluted exit-code handling from line 115. Consider applying the same simplification here.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 124 - 126, The current pipeline uses a
fragile grep chain on the last line of pytest.log after running "uv run pytest
..." (with set -o pipefail) to decide the job status; instead, capture and check
pytest's actual exit code from the pipeline (the shell PIPESTATUS entry for the
"uv run pytest" command) and exit with that non-zero code when tests fail.
Update the step that runs "uv run pytest tests/integration_tests/ -v
--log-cli-level=${log_level} -k ..." and replace the final "tail ... | grep ..."
logic with a check of the pipeline exit status (PIPESTATUS[0]) so the workflow
reliably fails when pytest fails.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 113-115: The current CI step uses a fragile grep chain after "set
-o pipefail" and "uv run pytest ... | tee pytest.log"—specifically the "tail -n
1 pytest.log | grep '=======' | grep 'passed' | grep -q 'failed' && exit 1 ||
exit 0" logic is confusing and unnecessary; replace it by relying on pytest's
exit code (remove the tail/grep chain so the job simply exits with pytest's exit
status when using set -o pipefail), or if you must inspect the logfile, simplify
to a single explicit check like grepping pytest.log for "failed" and exiting
non‑zero if found (refer to the CI step that runs "uv run pytest ... | tee
pytest.log" and the use of "set -o pipefail").
- Around line 124-126: The current pipeline uses a fragile grep chain on the
last line of pytest.log after running "uv run pytest ..." (with set -o pipefail)
to decide the job status; instead, capture and check pytest's actual exit code
from the pipeline (the shell PIPESTATUS entry for the "uv run pytest" command)
and exit with that non-zero code when tests fail. Update the step that runs "uv
run pytest tests/integration_tests/ -v --log-cli-level=${log_level} -k ..." and
replace the final "tail ... | grep ..." logic with a check of the pipeline exit
status (PIPESTATUS[0]) so the workflow reliably fails when pytest fails.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35854ef and 6d0eab1.

📒 Files selected for processing (1)

.github/workflows/ci.yml

Introduce a workflow step that incrementally includes integration test modules in collection to identify import side-effect culprits behind intermittent exit 139 in the non-db subset. Made-with: Cursor

Remove direct SeekdbEmbeddedClient import from test module and infer embedded mode from runtime server metadata, preventing Python 3.14 collection-time native crashes. Made-with: Cursor

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

hnwyllmm added 3 commits February 26, 2026 17:03

Refresh lockfile and dependency constraints for check stability.

c8f872d

Exclude yanked numpy 2.4.0 and regenerate uv.lock (including pylibseekdb upgrade) so `make check` passes consistently with --locked. Made-with: Cursor

fix test sparse vector sql

31d6b04

lazy loading SeekdbEmbeddedClient

5d9c6da

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread tests/unit_tests/test_sparse_vector_sql.py

hnwyllmm added 2 commits February 26, 2026 18:15

remove SeekdbEmbeddedClient from pyseekdb package

35854ef

test python3.14 on github action

6d0eab1

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

hnwyllmm added 4 commits February 26, 2026 20:10

test action

63dab3b

Add targeted CI probe for Python 3.14 integration crash.

aab0ed8

Introduce a workflow step that incrementally includes integration test modules in collection to identify import side-effect culprits behind intermittent exit 139 in the non-db subset. Made-with: Cursor

Avoid embedded import side effects in admin integration test.

dcc4ba5

Remove direct SeekdbEmbeddedClient import from test module and infer embedded mode from runtime server metadata, preventing Python 3.14 collection-time native crashes. Made-with: Cursor

remove debug code

ec39362

hnwyllmm merged commit 14528ed into oceanbase:develop Feb 26, 2026
9 checks passed

hnwyllmm deleted the refine-config branch February 26, 2026 13:00

coderabbitai Bot mentioned this pull request Mar 5, 2026

refine: unify configuration fields and typed analyzer properties #178

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine index configuration validation and typed helpers#173

Refine index configuration validation and typed helpers#173
hnwyllmm merged 10 commits intooceanbase:developfrom
hnwyllmm:refine-config

hnwyllmm commented Feb 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 26, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hnwyllmm commented Feb 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hnwyllmm commented Feb 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 26, 2026 •

edited

Loading