Add ontology support to dataset features endpoint#262
Add ontology support to dataset features endpoint#262hxrshxz wants to merge 2 commits intoopenml:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (5)
WalkthroughAdds ontology support for dataset features. A new function 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
get_feature_ontologies, consider usingrows.mappings()and accessing columns viarow["index"]/row["value"](orrow._mapping[...]) instead ofrow.indexandrow.valueto avoid potential conflicts withRow.indexand make the column access more explicit and robust.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `get_feature_ontologies`, consider using `rows.mappings()` and accessing columns via `row["index"]` / `row["value"]` (or `row._mapping[...]`) instead of `row.index` and `row.value` to avoid potential conflicts with `Row.index` and make the column access more explicit and robust.
## Individual Comments
### Comment 1
<location path="src/database/datasets.py" line_range="134-143" />
<code_context>
return [Feature(**row, nominal_values=None) for row in rows.mappings()]
+def get_feature_ontologies(dataset_id: int, connection: Connection) -> dict[int, list[str]]:
+ """Return a mapping from feature index to its list of ontology URIs."""
+ rows = connection.execute(
+ text(
+ """
+ SELECT `index`, `value`
+ FROM data_feature_description
+ WHERE `did` = :dataset_id AND `description_type` = 'ontology'
+ """,
+ ),
+ parameters={"dataset_id": dataset_id},
+ )
+ ontologies: dict[int, list[str]] = {}
+ for row in rows:
+ ontologies.setdefault(row.index, []).append(row.value)
+ return ontologies
</code_context>
<issue_to_address>
**issue (bug_risk):** Accessing `row.index`/`row.value` directly on SQLAlchemy `Row` can be fragile; prefer `.mappings()` or `_mapping[...]` for robustness.
In this loop you’re iterating raw `Row` objects and relying on `row.index` / `row.value`, which can collide with `Row.index()` and other internals, and may change across SQLAlchemy versions. Use mapping access instead, e.g. iterate `rows.mappings()` and use `row["index"]` / `row["value"]`, or use `row._mapping["index"]` / `row._mapping["value"]` on the existing result to avoid these name clashes.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/routers/openml/datasets.py (1)
296-298: Optional: defer ontology query until after empty-feature guard.Line 296 currently performs an extra read even when no features exist and the endpoint will raise immediately.
⚡ Small query-order optimization
- ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb) - for feature in features: - feature.ontology = ontologies.get(feature.index) - if not features: processing_state = database.datasets.get_latest_processing_update(dataset_id, expdb) ... raise HTTPException(...) + ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb) + for feature in features: + feature.ontology = ontologies.get(feature.index)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/routers/openml/datasets.py` around lines 296 - 298, The code calls database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if there are any features, causing an unnecessary DB read; change the order so you first check the features collection (e.g., if not features: return or raise as the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb) when features is non-empty, then iterate and set feature.ontology = ontologies.get(feature.index); this uses the existing symbols features, dataset_id, expdb, and database.datasets.get_feature_ontologies.tests/routers/openml/migration/datasets_migration_test.py (1)
225-228: Consider deduplicating legacy list-to-scalar normalization.Line 224 and Line 228 apply the same conversion pattern; extracting a tiny helper would reduce drift in future parity tweaks.
♻️ Minimal refactor
+ def _legacy_scalar_or_list(values: list[str]) -> str | list[str]: + return values if len(values) > 1 else values[0] + for feature in python_body: for key, value in list(feature.items()): if key == "nominal_values": values = feature.pop(key) - feature["nominal_value"] = values if len(values) > 1 else values[0] + feature["nominal_value"] = _legacy_scalar_or_list(values) elif key == "ontology": values = feature.pop(key) - feature["ontology"] = values if len(values) > 1 else values[0] + feature["ontology"] = _legacy_scalar_or_list(values)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/routers/openml/migration/datasets_migration_test.py` around lines 225 - 228, Extract the repeated "list-to-scalar" normalization into a small helper (e.g., normalize_legacy_list(value) or normalize_list_to_scalar) and replace both in-place patterns that operate on feature["ontology"] (the block that does values = feature.pop(key); feature["ontology"] = values if len(values) > 1 else values[0]) with a call to that helper so the code becomes feature["ontology"] = normalize_legacy_list(feature.pop(key)); ensure the helper accepts the popped value (either list or str) and returns the list when length > 1 or the single element otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/database/datasets.py`:
- Around line 139-142: The SQL selecting ontology entries from
data_feature_description (WHERE `did` = :dataset_id AND `description_type` =
'ontology') lacks an ORDER BY, causing non-deterministic array ordering; update
that query (and the similar occurrence that also selects ontology rows) to
append "ORDER BY `index` ASC" (or ORDER BY `index`) so results are consistently
ordered by the `index` column—apply this change to both places that query
data_feature_description for description_type='ontology'.
- Around line 147-148: The loop uses positional attribute access on SQLAlchemy
Row (for row in rows: ontologies.setdefault(row.index, []).append(row.value)),
but row.index resolves to the Row.index() method in SQLAlchemy 2.x; change to
mapping-style access (e.g., row['index'] and row['value'] or
row._mapping['index'] / row._mapping['value']) so the dictionary key is the
actual column value; update the code that iterates over rows and uses
ontologies.setdefault to use these mapping keys instead.
---
Nitpick comments:
In `@src/routers/openml/datasets.py`:
- Around line 296-298: The code calls
database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if
there are any features, causing an unnecessary DB read; change the order so you
first check the features collection (e.g., if not features: return or raise as
the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb)
when features is non-empty, then iterate and set feature.ontology =
ontologies.get(feature.index); this uses the existing symbols features,
dataset_id, expdb, and database.datasets.get_feature_ontologies.
In `@tests/routers/openml/migration/datasets_migration_test.py`:
- Around line 225-228: Extract the repeated "list-to-scalar" normalization into
a small helper (e.g., normalize_legacy_list(value) or normalize_list_to_scalar)
and replace both in-place patterns that operate on feature["ontology"] (the
block that does values = feature.pop(key); feature["ontology"] = values if
len(values) > 1 else values[0]) with a call to that helper so the code becomes
feature["ontology"] = normalize_legacy_list(feature.pop(key)); ensure the helper
accepts the popped value (either list or str) and returns the list when length >
1 or the single element otherwise.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/database/datasets.pysrc/routers/openml/datasets.pysrc/schemas/datasets/openml.pytests/routers/openml/datasets_test.pytests/routers/openml/migration/datasets_migration_test.py
ab592f7 to
d132eb2
Compare
Closes #237
Fetches ontology URIs from
data_feature_descriptionand includes them inGET /datasets/features/{dataset_id}responses, achieving feature parity with the PHP API.