Skip to content

Refine index configuration validation and typed helpers#173

Merged
hnwyllmm merged 10 commits intooceanbase:developfrom
hnwyllmm:refine-config
Feb 26, 2026
Merged

Refine index configuration validation and typed helpers#173
hnwyllmm merged 10 commits intooceanbase:developfrom
hnwyllmm:refine-config

Conversation

@hnwyllmm
Copy link
Copy Markdown
Member

@hnwyllmm hnwyllmm commented Feb 26, 2026

Summary

  • add typed HNSW/fulltext configuration helpers and export them from package entry points
  • enforce stricter parameter validation and normalization for dense/sparse/fulltext index configs while preserving forward-compatible pass-through for unknown analyzers/properties
  • update vector index SQL generation to honor configured type/lib and extend unit tests for new defaults and validation behavior

Test plan

  • PYTHONPATH=src pytest tests/unit_tests/test_configuration.py tests/unit_tests/test_sparse_vector_index_config.py
  • Run full unit/integration suite in CI

Made with Cursor

Summary by CodeRabbit

  • New Features

    • Added specialized fulltext configs (IK, SPACE, BENG, NGRAM, NGRAM2) and BQ/SQ variants for HNSW vector indexes.
  • Improvements

    • Stronger validation, normalization and safer defaults for index and sparse-vector configs (updated prune/refine/drop/refine_k defaults).
    • HNSW config now exposes configurable type/lib and uses those values in index generation.
  • Breaking Changes

    • Top-level export for the embedded client removed — import the client from the client module instead.
  • Chores

    • Excluded a specific numpy release in dependency constraints.

Improve developer ergonomics with dedicated HNSW/fulltext config helpers, enforce documented parameter validation with clear warnings, and align SQL generation/exports/tests with the expanded configuration model.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Expands and tightens configuration dataclasses/enums and validation, changes SparseVector defaults, makes VECTOR INDEX SQL use HNSW config fields, lazily exposes SeekdbEmbeddedClient from client package, updates tests and numpy constraint.

Changes

Cohort / File(s) Summary
Top-level exports
src/pyseekdb/__init__.py
Removed SeekdbEmbeddedClient from direct export; added re-exports for new config dataclasses and enums (IK*, SPACE, BENG, NGRAM, NGRAM2, Bq/Sq HNSW configs, etc.).
Client package surface
src/pyseekdb/client/__init__.py
Added new public config/type exports; added TYPE_CHECKING import and lazy SeekdbEmbeddedClient exposure via __getattr__ to defer import-time errors.
Client SQL generation
src/pyseekdb/client/client_base.py
VECTOR INDEX SQL now derives TYPE and LIB from hnsw_config.type / hnsw_config.lib instead of hard-coded values.
Configuration core
src/pyseekdb/client/configuration.py
Major refactor: introduced PrimitiveValue alias, normalization/validation helpers, new enums (HNSWIndexType, HNSWIndexLib, FulltextAnalyzer, IKMode, BQRefineType), typed FulltextIndexConfig variants (IK/SPACE/BENG/NGRAM/NGRAM2), HNSW variants (BqHNSWConfiguration, SqHNSWConfiguration), stronger SparseVectorIndexConfig defaults/validation, and a deprecated Configuration wrapper.
Tests — configuration
tests/unit_tests/test_configuration.py
Updated tests for new config classes, HNSW type/lib handling, analyzer property validation, and adjusted expected SQL output.
Tests — sparse vector
tests/unit_tests/test_sparse_vector_index_config.py, tests/unit_tests/test_sparse_vector_sql.py
Adjusted defaults/assertions for SparseVectorIndexConfig (prune/refine/drop ratios/refine_k) and relaxed one SQL assertion.
Integration test update
tests/integration_tests/test_admin_database_management.py
Replaced module-qualified pyseekdb.SeekdbEmbeddedClient type check with direct import from pyseekdb.client.client_seekdb_embedded.
Project config
pyproject.toml
Updated numpy constraint to exclude 2.4.0 (>=1.26,!=2.4.0).
CI workflow
.github/workflows/ci.yml
Added environment variables for tests and a Python 3.14 sanity-check step for quick runtime checks.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I hopped through enums and tidy keys,

I stitched HNSW types with gentle ease,
IK, Ngram, Space now line the trail,
A lazy client peeks when imports fail,
Configs bloom bright — a rabbit's happy squeal.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Refine index configuration validation and typed helpers' accurately summarizes the main change: adding validation, typed configuration helpers, and exporting them from package entry points.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit_tests/test_configuration.py (1)

226-255: Consider adding deprecation warning assertion.

The Configuration class emits a DeprecationWarning when instantiated, but none of these tests assert that the warning is raised. Consider wrapping the Configuration() calls with pytest.warns(DeprecationWarning) to document and verify the expected deprecation behavior.

Example for one test case
     def test_configuration_empty(self):
         """Test Configuration with no parameters"""
-        config = Configuration()
+        with pytest.warns(DeprecationWarning, match="Configuration is deprecated"):
+            config = Configuration()
         assert config.hnsw is None
         assert config.fulltext_config is None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/test_configuration.py` around lines 226 - 255, Each test
should assert the deprecation warning emitted by instantiating Configuration:
wrap each instantiation of Configuration in the tests
(test_configuration_with_hnsw_only, test_configuration_with_fulltext_only,
test_configuration_with_both, test_configuration_empty) with
pytest.warns(DeprecationWarning) so the tests verify the DeprecationWarning
produced by the Configuration class upon construction.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit_tests/test_configuration.py`:
- Around line 226-255: Each test should assert the deprecation warning emitted
by instantiating Configuration: wrap each instantiation of Configuration in the
tests (test_configuration_with_hnsw_only, test_configuration_with_fulltext_only,
test_configuration_with_both, test_configuration_empty) with
pytest.warns(DeprecationWarning) so the tests verify the DeprecationWarning
produced by the Configuration class upon construction.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc1f203 and 2f784de.

📒 Files selected for processing (6)
  • src/pyseekdb/__init__.py
  • src/pyseekdb/client/__init__.py
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/configuration.py
  • tests/unit_tests/test_configuration.py
  • tests/unit_tests/test_sparse_vector_index_config.py

Exclude yanked numpy 2.4.0 and regenerate uv.lock (including pylibseekdb upgrade) so `make check` passes consistently with --locked.

Made-with: Cursor
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pyseekdb/__init__.py (1)

112-147: ⚠️ Potential issue | 🟠 Major

SeekdbEmbeddedClient is exported in __all__ but not actually importable from the top-level module.

__all__ includes "SeekdbEmbeddedClient" (line 135), but this module does not import it from .client and does not define __getattr__ to resolve it lazily. This breaks the expected import path from pyseekdb import SeekdbEmbeddedClient used in integration tests.

The symbol is available in the .client submodule via its __getattr__, but the top-level module needs its own __getattr__ to delegate to it.

Proposed fix
 from .client.collection import Collection
 
 try:
     __version__ = importlib.metadata.version("pyseekdb")
 except importlib.metadata.PackageNotFoundError:
     __version__ = "0.0.1.dev1"
 
 __author__ = "OceanBase <open_oceanbase@oceanbase.com>"
+
+
+def __getattr__(name: str):
+    if name == "SeekdbEmbeddedClient":
+        from .client import SeekdbEmbeddedClient
+        return SeekdbEmbeddedClient
+    raise AttributeError(f"module '{__name__}' has no attribute '{name}'")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyseekdb/__init__.py` around lines 112 - 147, The top-level __all__
exports "SeekdbEmbeddedClient" but the module doesn't expose it, causing imports
like from pyseekdb import SeekdbEmbeddedClient to fail; add a top-level
__getattr__ that delegates missing attribute lookups to the .client submodule
(or explicitly import SeekdbEmbeddedClient from .client into this module) so the
exported symbol is resolvable; reference the existing __all__ list, the symbol
SeekdbEmbeddedClient, and implement __getattr__(name) to import .client and
return getattr(client, name) (or perform a one-time import of
SeekdbEmbeddedClient) to restore the expected top-level import behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unit_tests/test_sparse_vector_sql.py`:
- Around line 55-58: The test test_no_optional_params_by_default currently
discards the result of _get_sparse_vector_index_sql(config); update it to assert
meaningful behavior by capturing the returned SQL and asserting either that it
equals the expected default SQL string or at minimum is non-empty and contains
expected tokens (e.g., "CREATE INDEX", the table/column names produced by
_make_config). Locate test_no_optional_params_by_default, call sql =
_get_sparse_vector_index_sql(config), and add assertions (e.g., assert sql
and/or assert "CREATE INDEX" in sql and/or assert the exact expected SQL) so
regressions in default SQL rendering are caught.

---

Outside diff comments:
In `@src/pyseekdb/__init__.py`:
- Around line 112-147: The top-level __all__ exports "SeekdbEmbeddedClient" but
the module doesn't expose it, causing imports like from pyseekdb import
SeekdbEmbeddedClient to fail; add a top-level __getattr__ that delegates missing
attribute lookups to the .client submodule (or explicitly import
SeekdbEmbeddedClient from .client into this module) so the exported symbol is
resolvable; reference the existing __all__ list, the symbol
SeekdbEmbeddedClient, and implement __getattr__(name) to import .client and
return getattr(client, name) (or perform a one-time import of
SeekdbEmbeddedClient) to restore the expected top-level import behavior.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8f872d and 5d9c6da.

📒 Files selected for processing (3)
  • src/pyseekdb/__init__.py
  • src/pyseekdb/client/__init__.py
  • tests/unit_tests/test_sparse_vector_sql.py

Comment thread tests/unit_tests/test_sparse_vector_sql.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
.github/workflows/ci.yml (2)

113-115: The grep chain logic is confusing and potentially fragile.

The expression grep 'passed' | grep -q 'failed' && exit 1 || exit 0 attempts to find lines containing both "passed" and "failed", but this is convoluted. With set -o pipefail, pytest's exit code should already propagate correctly.

If the intent is to catch summary lines like 5 passed, 2 failed, consider simplifying:

♻️ Suggested simplification
          set -o pipefail
-          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} | tee pytest.log
-          tail -n 1 pytest.log | grep '=======' | grep 'passed' |grep -q 'failed' && exit 1 || exit 0
+          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} --tb=short

Alternatively, if log capture is needed, rely on pytest's exit code directly:

          set -o pipefail
-          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} | tee pytest.log
-          tail -n 1 pytest.log | grep '=======' | grep 'passed' |grep -q 'failed' && exit 1 || exit 0
+          uv run pytest tests/unit_tests/ -v --log-cli-level=${log_level} 2>&1 | tee pytest.log
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 113 - 115, The current CI step uses a
fragile grep chain after "set -o pipefail" and "uv run pytest ... | tee
pytest.log"—specifically the "tail -n 1 pytest.log | grep '=======' | grep
'passed' | grep -q 'failed' && exit 1 || exit 0" logic is confusing and
unnecessary; replace it by relying on pytest's exit code (remove the tail/grep
chain so the job simply exits with pytest's exit status when using set -o
pipefail), or if you must inspect the logfile, simplify to a single explicit
check like grepping pytest.log for "failed" and exiting non‑zero if found (refer
to the CI step that runs "uv run pytest ... | tee pytest.log" and the use of
"set -o pipefail").

124-126: Same fragile grep pattern as noted above.

This duplicates the convoluted exit-code handling from line 115. Consider applying the same simplification here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 124 - 126, The current pipeline uses a
fragile grep chain on the last line of pytest.log after running "uv run pytest
..." (with set -o pipefail) to decide the job status; instead, capture and check
pytest's actual exit code from the pipeline (the shell PIPESTATUS entry for the
"uv run pytest" command) and exit with that non-zero code when tests fail.
Update the step that runs "uv run pytest tests/integration_tests/ -v
--log-cli-level=${log_level} -k ..." and replace the final "tail ... | grep ..."
logic with a check of the pipeline exit status (PIPESTATUS[0]) so the workflow
reliably fails when pytest fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 113-115: The current CI step uses a fragile grep chain after "set
-o pipefail" and "uv run pytest ... | tee pytest.log"—specifically the "tail -n
1 pytest.log | grep '=======' | grep 'passed' | grep -q 'failed' && exit 1 ||
exit 0" logic is confusing and unnecessary; replace it by relying on pytest's
exit code (remove the tail/grep chain so the job simply exits with pytest's exit
status when using set -o pipefail), or if you must inspect the logfile, simplify
to a single explicit check like grepping pytest.log for "failed" and exiting
non‑zero if found (refer to the CI step that runs "uv run pytest ... | tee
pytest.log" and the use of "set -o pipefail").
- Around line 124-126: The current pipeline uses a fragile grep chain on the
last line of pytest.log after running "uv run pytest ..." (with set -o pipefail)
to decide the job status; instead, capture and check pytest's actual exit code
from the pipeline (the shell PIPESTATUS entry for the "uv run pytest" command)
and exit with that non-zero code when tests fail. Update the step that runs "uv
run pytest tests/integration_tests/ -v --log-cli-level=${log_level} -k ..." and
replace the final "tail ... | grep ..." logic with a check of the pipeline exit
status (PIPESTATUS[0]) so the workflow reliably fails when pytest fails.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35854ef and 6d0eab1.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml

Introduce a workflow step that incrementally includes integration test modules in collection to identify import side-effect culprits behind intermittent exit 139 in the non-db subset.

Made-with: Cursor
Remove direct SeekdbEmbeddedClient import from test module and infer embedded mode from runtime server metadata, preventing Python 3.14 collection-time native crashes.

Made-with: Cursor
@hnwyllmm hnwyllmm merged commit 14528ed into oceanbase:develop Feb 26, 2026
9 checks passed
@hnwyllmm hnwyllmm deleted the refine-config branch February 26, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant