Skip to content

Add ontology support to dataset features endpoint#262

Open
hxrshxz wants to merge 2 commits intoopenml:mainfrom
hxrshxz:fix/237-ontology-support
Open

Add ontology support to dataset features endpoint#262
hxrshxz wants to merge 2 commits intoopenml:mainfrom
hxrshxz:fix/237-ontology-support

Conversation

@hxrshxz
Copy link

@hxrshxz hxrshxz commented Feb 28, 2026

Closes #237

Fetches ontology URIs from data_feature_description and includes them in GET /datasets/features/{dataset_id} responses, achieving feature parity with the PHP API.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2026

Warning

Rate limit exceeded

@hxrshxz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 13 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between ab592f7 and d132eb2.

📒 Files selected for processing (5)
  • src/database/datasets.py
  • src/routers/openml/datasets.py
  • src/schemas/datasets/openml.py
  • tests/routers/openml/datasets_test.py
  • tests/routers/openml/migration/datasets_migration_test.py

Walkthrough

Adds ontology support for dataset features. A new function get_feature_ontologies(dataset_id, connection) queries the data\_feature\_description table for description_type = 'ontology' and returns a mapping from feature index to a list of ontology URIs. The OpenML datasets router fetches these ontologies and assigns them to feature.ontology. The Feature schema gains an optional ontology: list[str] | None field. Tests were added/updated to verify ontology presence, absence, and migration handling.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding ontology support to the dataset features endpoint.
Description check ✅ Passed The description is relevant and explains that it closes issue #237 by fetching ontology URIs and including them in API responses for feature parity with PHP.
Linked Issues check ✅ Passed The PR meets the objectives from issue #237 by implementing ontology support in the dataset features endpoint, matching PHP implementation requirements.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing ontology support for dataset features; no out-of-scope modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In get_feature_ontologies, consider using rows.mappings() and accessing columns via row["index"] / row["value"] (or row._mapping[...]) instead of row.index and row.value to avoid potential conflicts with Row.index and make the column access more explicit and robust.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `get_feature_ontologies`, consider using `rows.mappings()` and accessing columns via `row["index"]` / `row["value"]` (or `row._mapping[...]`) instead of `row.index` and `row.value` to avoid potential conflicts with `Row.index` and make the column access more explicit and robust.

## Individual Comments

### Comment 1
<location path="src/database/datasets.py" line_range="134-143" />
<code_context>
     return [Feature(**row, nominal_values=None) for row in rows.mappings()]


+def get_feature_ontologies(dataset_id: int, connection: Connection) -> dict[int, list[str]]:
+    """Return a mapping from feature index to its list of ontology URIs."""
+    rows = connection.execute(
+        text(
+            """
+            SELECT `index`, `value`
+            FROM data_feature_description
+            WHERE `did` = :dataset_id AND `description_type` = 'ontology'
+            """,
+        ),
+        parameters={"dataset_id": dataset_id},
+    )
+    ontologies: dict[int, list[str]] = {}
+    for row in rows:
+        ontologies.setdefault(row.index, []).append(row.value)
+    return ontologies
</code_context>
<issue_to_address>
**issue (bug_risk):** Accessing `row.index`/`row.value` directly on SQLAlchemy `Row` can be fragile; prefer `.mappings()` or `_mapping[...]` for robustness.

In this loop you’re iterating raw `Row` objects and relying on `row.index` / `row.value`, which can collide with `Row.index()` and other internals, and may change across SQLAlchemy versions. Use mapping access instead, e.g. iterate `rows.mappings()` and use `row["index"]` / `row["value"]`, or use `row._mapping["index"]` / `row._mapping["value"]` on the existing result to avoid these name clashes.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/routers/openml/datasets.py (1)

296-298: Optional: defer ontology query until after empty-feature guard.

Line 296 currently performs an extra read even when no features exist and the endpoint will raise immediately.

⚡ Small query-order optimization
-    ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb)
-    for feature in features:
-        feature.ontology = ontologies.get(feature.index)
-
     if not features:
         processing_state = database.datasets.get_latest_processing_update(dataset_id, expdb)
         ...
         raise HTTPException(...)

+    ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb)
+    for feature in features:
+        feature.ontology = ontologies.get(feature.index)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 296 - 298, The code calls
database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if
there are any features, causing an unnecessary DB read; change the order so you
first check the features collection (e.g., if not features: return or raise as
the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb)
when features is non-empty, then iterate and set feature.ontology =
ontologies.get(feature.index); this uses the existing symbols features,
dataset_id, expdb, and database.datasets.get_feature_ontologies.
tests/routers/openml/migration/datasets_migration_test.py (1)

225-228: Consider deduplicating legacy list-to-scalar normalization.

Line 224 and Line 228 apply the same conversion pattern; extracting a tiny helper would reduce drift in future parity tweaks.

♻️ Minimal refactor
+    def _legacy_scalar_or_list(values: list[str]) -> str | list[str]:
+        return values if len(values) > 1 else values[0]
+
     for feature in python_body:
         for key, value in list(feature.items()):
             if key == "nominal_values":
                 values = feature.pop(key)
-                feature["nominal_value"] = values if len(values) > 1 else values[0]
+                feature["nominal_value"] = _legacy_scalar_or_list(values)
             elif key == "ontology":
                 values = feature.pop(key)
-                feature["ontology"] = values if len(values) > 1 else values[0]
+                feature["ontology"] = _legacy_scalar_or_list(values)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/migration/datasets_migration_test.py` around lines 225 -
228, Extract the repeated "list-to-scalar" normalization into a small helper
(e.g., normalize_legacy_list(value) or normalize_list_to_scalar) and replace
both in-place patterns that operate on feature["ontology"] (the block that does
values = feature.pop(key); feature["ontology"] = values if len(values) > 1 else
values[0]) with a call to that helper so the code becomes feature["ontology"] =
normalize_legacy_list(feature.pop(key)); ensure the helper accepts the popped
value (either list or str) and returns the list when length > 1 or the single
element otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/database/datasets.py`:
- Around line 139-142: The SQL selecting ontology entries from
data_feature_description (WHERE `did` = :dataset_id AND `description_type` =
'ontology') lacks an ORDER BY, causing non-deterministic array ordering; update
that query (and the similar occurrence that also selects ontology rows) to
append "ORDER BY `index` ASC" (or ORDER BY `index`) so results are consistently
ordered by the `index` column—apply this change to both places that query
data_feature_description for description_type='ontology'.
- Around line 147-148: The loop uses positional attribute access on SQLAlchemy
Row (for row in rows: ontologies.setdefault(row.index, []).append(row.value)),
but row.index resolves to the Row.index() method in SQLAlchemy 2.x; change to
mapping-style access (e.g., row['index'] and row['value'] or
row._mapping['index'] / row._mapping['value']) so the dictionary key is the
actual column value; update the code that iterates over rows and uses
ontologies.setdefault to use these mapping keys instead.

---

Nitpick comments:
In `@src/routers/openml/datasets.py`:
- Around line 296-298: The code calls
database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if
there are any features, causing an unnecessary DB read; change the order so you
first check the features collection (e.g., if not features: return or raise as
the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb)
when features is non-empty, then iterate and set feature.ontology =
ontologies.get(feature.index); this uses the existing symbols features,
dataset_id, expdb, and database.datasets.get_feature_ontologies.

In `@tests/routers/openml/migration/datasets_migration_test.py`:
- Around line 225-228: Extract the repeated "list-to-scalar" normalization into
a small helper (e.g., normalize_legacy_list(value) or normalize_list_to_scalar)
and replace both in-place patterns that operate on feature["ontology"] (the
block that does values = feature.pop(key); feature["ontology"] = values if
len(values) > 1 else values[0]) with a call to that helper so the code becomes
feature["ontology"] = normalize_legacy_list(feature.pop(key)); ensure the helper
accepts the popped value (either list or str) and returns the list when length >
1 or the single element otherwise.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72989df and 5280c6a.

📒 Files selected for processing (5)
  • src/database/datasets.py
  • src/routers/openml/datasets.py
  • src/schemas/datasets/openml.py
  • tests/routers/openml/datasets_test.py
  • tests/routers/openml/migration/datasets_migration_test.py

@hxrshxz hxrshxz force-pushed the fix/237-ontology-support branch from ab592f7 to d132eb2 Compare February 28, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ontology support

1 participant