Database documentation by PGijsbers · Pull Request #320 · openml/server-api

PGijsbers · 2026-05-01T14:14:06Z

Add documentation about the different tables in the database.

coderabbitai · 2026-05-01T14:14:20Z

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

Adds comprehensive database schema documentation for OpenML’s two MySQL databases via three new Markdown pages: docs/database/index.md (overview and operational access), docs/database/openml.md (schema for user management, groups, join tables, and file metadata), and docs/database/openml_expdb.md (detailed experiment-related schema: datasets, features/qualities, tasks, flows/implementations, setups, runs/evaluations, studies, and community tables). Updates mkdocs.yml to add a top-level "Database" navigation section linking these pages.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Database documentation' directly and clearly reflects the main change: adding comprehensive documentation for database tables across three new documentation files and integrating them into the navigation.
Description check	✅ Passed	The description 'Add documentation about the different tables in the database' accurately describes the changeset, which adds detailed documentation for tables in the openml and openml_expdb databases.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs-database

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 11 issues, and left some high level feedback:

Several tables/columns are described as unused or unclear (e.g., setup_differences, algorithm_structure, community tables, legacy studies); consider grouping these into an explicit “deprecated/unclear” subsection with a short, consistent note so readers can quickly distinguish actively used schema from legacy parts.
Some fields are documented with uncertainty (e.g., setup_differences.differences, implementation.algorithm_structure, evaluation_sample API accessibility); it would be helpful to either add a brief hypothesis of their historical intent or mark them with a clear TODO so future maintainers know what still needs investigation.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Several tables/columns are described as unused or unclear (e.g., `setup_differences`, `algorithm_structure`, community tables, legacy studies); consider grouping these into an explicit “deprecated/unclear” subsection with a short, consistent note so readers can quickly distinguish actively used schema from legacy parts.
- Some fields are documented with uncertainty (e.g., `setup_differences.differences`, `implementation.algorithm_structure`, `evaluation_sample` API accessibility); it would be helpful to either add a brief hypothesis of their historical intent or mark them with a clear TODO so future maintainers know what still needs investigation.

## Individual Comments

### Comment 1
<location path="docs/database/openml_expdb.md" line_range="102" />
<code_context>
+Assigns topic labels to datasets.
+Topics are displayed as tags on the web page.
+This is the result of an experiment in 2021 to try categorize datasets to better facilitate search.
+Topics have been added by automated analysis.
+
+| Column | Type | Optional | Default | References | Description | Example |
</code_context>
<issue_to_address>
**issue (typo):** Fix grammar in "try categorize datasets"

Change "to try categorize datasets" to "to try to categorize datasets" for correct grammar.

```suggestion
This is the result of an experiment in 2021 to try to categorize datasets to better facilitate search.
```
</issue_to_address>

### Comment 2
<location path="docs/database/openml_expdb.md" line_range="157-158" />
<code_context>
+
+User-provided descriptions and ontology annotations for individual data features.
+The table is empty in production (as of 2026-04-29).
+This feature was added recently (2025ish) should be considered experimental.
+
+| Column | Type | Optional | Default | References | Description | Example |
</code_context>
<issue_to_address>
**issue (typo):** Add missing conjunction in data_feature_description sentence

The sentence is grammatically off. Suggest: "This feature was added recently (2025ish) and should be considered experimental."

```suggestion
User-provided descriptions and ontology annotations for individual data features.
The table is empty in production (as of 2026-04-29) and should be considered experimental.
```
</issue_to_address>

### Comment 3
<location path="docs/database/openml_expdb.md" line_range="164" />
<code_context>
+| Column | Type | Optional | Default | References | Description | Example |
+|--------|------|----------|---------|------------|-------------|---------|
+| did | int unsigned | No | | [data_feature(did, index)](#data_feature) | Dataset ID. | 2 |
+| index | int unsigned | No | | [data_feature(did, index)](#data_feature) | Feature index (column position, 0 indexed). | 0 |
+| uploader | mediumint unsigned | No | | | User who added the description. | 1548 |
+| date | timestamp | No | CURRENT_TIMESTAMP | | When the description was added. | 2025-12-12 09:29:30 |
</code_context>
<issue_to_address>
**nitpick (typo):** Hyphenate "0-indexed" for consistency

Please change "0 indexed" to "0-indexed" to align with the rest of the document and common usage.

```suggestion
| index | int unsigned | No | | [data_feature(did, index)](#data_feature) | Feature index (column position, 0-indexed). | 0 |
```
</issue_to_address>

### Comment 4
<location path="docs/database/openml_expdb.md" line_range="251" />
<code_context>
+So there are contraints from the `task_inputs` table to multiple other tables (`dataset`, `estimation_procedure`, ...) that are not explicitly present in the database.
</code_context>
<issue_to_address>
**issue (typo):** Correct spelling of "constraints"

```suggestion
So there are constraints from the `task_inputs` table to multiple other tables (`dataset`, `estimation_procedure`, ...) that are not explicitly present in the database.
```
</issue_to_address>

### Comment 5
<location path="docs/database/openml_expdb.md" line_range="300" />
<code_context>
+Production uses the following directives:
+
+ - `[INPUT:source_data]`: look up the value of `task_inputs.value` where `input="source_data"` and `task_id` matches the task.
+ - `[TASK:ttid]`: look up the valud of `task.ttid` for that task.
+
+The `template_api` contains XML instead:
</code_context>
<issue_to_address>
**issue (typo):** Correct spelling of "value"

In the second bullet, change `valud` to `value`.

```suggestion
 - `[TASK:ttid]`: look up the value of `task.ttid` for that task.
```
</issue_to_address>

### Comment 6
<location path="docs/database/openml_expdb.md" line_range="417" />
<code_context>
+| Column | Type | Optional | Default | References | Description | Example |
+|--------|------|----------|---------|------------|-------------|---------|
+| id | int | No | auto_increment | | Primary key (flow ID). |  1 |
+| fullName | varchar(1024) | No | | | Full qualified name (name + version). | sklearn.tree.DecisionTreeClassifier(1) |
+| uploader | mediumint unsigned | Yes | NULL | [openml.users.id](openml.md#users) | User who uploaded the flow. | 2 |
+| name | varchar(1024) | No | | | Flow name. | sklearn.tree.DecisionTreeClassifier |
</code_context>
<issue_to_address>
**issue (typo):** Use "Fully qualified" instead of "Full qualified"

In the `fullName` description, change "Full qualified name" to "Fully qualified name".

```suggestion
| fullName | varchar(1024) | No | | | Fully qualified name (name + version). | sklearn.tree.DecisionTreeClassifier(1) |
```
</issue_to_address>

### Comment 7
<location path="docs/database/openml_expdb.md" line_range="625" />
<code_context>
+|--------|------|----------|---------|------------|-------------|---------|
+| id | int | No | auto_increment | | Primary key. | 1 |
+| name | varchar(64) | No | | | Metric name (unique). | EuclidianDistance |
+| functionType | varchar(128) | No | 'EvaluationFunction' | | Type of function. One of Metric, KernelFuction, or EvaluationFuction | Metric |
+| min | varchar(64) | No | | | Minimum possible value. | 0 |
+| max | varchar(64) | No | | | Maximum possible value. | '' |
</code_context>
<issue_to_address>
**issue (typo):** Correct misspelling of "Function" in listed types

In the `functionType` description, please correct the prose to "KernelFunction" and "EvaluationFunction".

```suggestion
| functionType | varchar(128) | No | 'EvaluationFunction' | | Type of function. One of Metric, KernelFunction, or EvaluationFunction | Metric |
```
</issue_to_address>

### Comment 8
<location path="docs/database/openml_expdb.md" line_range="698" />
<code_context>
+## Studies
+
+Studies have historically been in flux, but they are generally collections of objects (e.g., tasks).
+One mayor change during OpenML's lifetime was in how those collections were defined, which happened around 2019.
+Nowadays there are dedicated tables (e.g., study_task) to make the connection between a study and its task.
+Historically, what is now referred to as a "legacy study", this association was achieved through tags.
</code_context>
<issue_to_address>
**issue (typo):** Correct "mayor" to "major"

In the sentence starting with "One mayor change", replace "mayor" with "major".

```suggestion
One major change during OpenML's lifetime was in how those collections were defined, which happened around 2019.
```
</issue_to_address>

### Comment 9
<location path="docs/database/openml_expdb.md" line_range="702" />
<code_context>
+Nowadays there are dedicated tables (e.g., study_task) to make the connection between a study and its task.
+Historically, what is now referred to as a "legacy study", this association was achieved through tags.
+E.g., all tasks with tag `study_14` would be considered part of `study_14`.
+Some studies are still defined this way, and migration of these studies to new-style studies are planned.
+The Python-based REST API will not support legacy style studies (hence the data migration needs to occur to make the legacy studies usable with the new API).
+
</code_context>
<issue_to_address>
**issue (typo):** Fix subject–verb agreement in migration sentence

Change “migration of these studies to new-style studies are planned” to either “migration … is planned” or “migrations … are planned” for correct subject–verb agreement.

```suggestion
Some studies are still defined this way, and migration of these studies to new-style studies is planned.
```
</issue_to_address>

### Comment 10
<location path="docs/database/openml.md" line_range="38" />
<code_context>
+| core | enum('true','false') | No | 'false' | | Whether the user is a core team member. | false |
+| external_source | varchar(50) | Yes | NULL | | External authentication provider (e.g., OAuth). not in use | 0000 |
+| external_id | varchar(50) | Yes | NULL | | User ID from external authentication provider. not in use |  0000 |
+| session_hash | varchar(40) | Yes | NULL | | Hash for API session authentication. 32 digit hexidecimal | - |
+| session_hash_date | timestamp | Yes | CURRENT_TIMESTAMP | | When the session hash was last generated. | 2024-10-20 20:18:54 |
+| gamification_visibility | varchar(32) | No | 'show' | | Visibility setting for gamification badges. One of 'show' or 'hidden' | hidden |
</code_context>
<issue_to_address>
**issue (typo):** Correct spelling of "hexadecimal"

Typo in the description: "hexidecimal" should be "hexadecimal".

```suggestion
| session_hash | varchar(40) | Yes | NULL | | Hash for API session authentication. 32 digit hexadecimal | - |
```
</issue_to_address>

### Comment 11
<location path="docs/database/index.md" line_range="15" />
<code_context>
+When launching the services as described in ["Installation"](../installation.md), you can access the mysql server with both databases using `docker compose exec database mysql -uroot -pok`.
+
+## Why we use queries instead of an ORM tool
+There are a two main reasons why we do not use an ORM tool *yet*.
+
+First, we want to keep queries close to the original PHP implementation.
</code_context>
<issue_to_address>
**issue (typo):** Remove extra article in "a two main reasons"

Specifically, update the sentence to: "There are two main reasons why we do not use an ORM tool *yet*."

```suggestion
There are two main reasons why we do not use an ORM tool *yet*.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

docs/database/index.md (1)
10-10: ⚡ Quick win

Minor wording cleanup for readability around the ORM rationale.

Static analysis flags Line 10 as needing a shorter/rephrased sentence (it currently reads as part of the preceding paragraph and flows into “When launching…”).

Consider breaking it into two sentences or adding a short line break to improve readability.
✏️ Example reflow
-The plan is to revise the database schema after we sunset the PHP API, to avoid having to make changes to two APIs.
+The plan is to revise the database schema after we sunset the PHP API.
+This helps avoid having to make changes to two APIs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/database/index.md` at line 10, Split or reflow the sentence "The plan is
to revise the database schema after we sunset the PHP API, to avoid having to
make changes to two APIs." in docs/database/index.md so it reads as a standalone
short sentence or two—e.g., end the prior paragraph, make this its own sentence,
or rewrite to something like "We will revise the database schema after
sunsetting the PHP API to avoid modifying two APIs."—so it no longer runs on
with the previous paragraph and improves readability.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/database/openml_expdb.md`:
- Around line 524-528: The phrase "out of use" is wordy in the docs; update
occurrences in the setup_differences section (heading "setup_differences") and
the schedule note (heading or paragraph referencing "schedule") to a shorter
term such as "unused" or "no longer used" (pick one consistent term across the
file), replacing the exact phrase "out of use" wherever it appears (including
the lines around the setup differences note and the schedule note) so the
wording is concise and uniform.
- Around line 288-328: The MD046 warnings come from fenced code blocks in the
examples for api_constraints, template_api, and template_search; fix by
converting each fenced block (the JSON block under "The `api_constraints`
contains JSON...", the XML block under "The `template_api` contains XML...", and
the JSON block under "Finally, the `template_search` contains JSON...") into
indented code blocks (indent every line by four spaces and ensure a blank line
before the block) so markdownlint accepts them, or alternatively update the
repository's markdownlint config to allow fenced blocks if that is the intended
style.
- Around line 407-409: The phrase "the database uses historical names
'implementation' for a flow and 'algorithm' for ??" leaves an unresolved "??";
locate the sentence containing 'implementation' and 'algorithm' and either (A)
replace "??" with the precise DB/PHP mapping for "algorithm" (e.g., the actual
column/table/class name used in the schema/PHP code) if you can confirm it, or
(B) rewrite the sentence to remove the unknown mapping—e.g. "the database uses
historical names: 'implementation' for a flow and 'algorithm' for the algorithm
entry"—so it no longer implies an unknown mapping; ensure you update the wording
around the symbols 'implementation' and 'algorithm' to be clear and consistent
with the DB/PHP naming.
- Around line 298-301: Fix the typo in the documentation directive for
`[TASK:ttid]`: replace the misspelled word "valud" with "value" in the sentence
that reads `look up the valud of task.ttid` so it correctly states `look up the
value of task.ttid` (update the `[TASK:ttid]` line in openml_expdb.md).
- Around line 699-704: Replace the unhyphenated phrase "legacy style studies"
with "legacy-style studies" in the discussed paragraph (e.g., update the
sentence "The Python-based REST API will not support legacy style studies" and
any other occurrences in this section) so the compound adjective is properly
hyphenated; keep surrounding wording unchanged.
- Around line 235-254: Replace the two typos in the task-structure narrative:
change "custom testset" to "custom test set" and change "contraints" to
"constraints" so the phrasing is correct (these appear in the paragraphs
describing optional task inputs and the constraints from task_inputs to other
tables).

In `@docs/database/openml.md`:
- Line 38: Update the table row for session_hash to fix formatting and spelling:
change the description text "32 digit hexidecimal" to "32-digit hexadecimal" so
the field reads "Hash for API session authentication. 32-digit hexadecimal";
locate the row containing the symbol "session_hash" and replace the phrase
accordingly.

---

Nitpick comments:
In `@docs/database/index.md`:
- Line 10: Split or reflow the sentence "The plan is to revise the database
schema after we sunset the PHP API, to avoid having to make changes to two
APIs." in docs/database/index.md so it reads as a standalone short sentence or
two—e.g., end the prior paragraph, make this its own sentence, or rewrite to
something like "We will revise the database schema after sunsetting the PHP API
to avoid modifying two APIs."—so it no longer runs on with the previous
paragraph and improves readability.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 12302f09-8241-47d3-94ae-31955176c5ce

📥 Commits

Reviewing files that changed from the base of the PR and between 2c377fd and 6bf8bbf.

📒 Files selected for processing (4)

docs/database/index.md
docs/database/openml.md
docs/database/openml_expdb.md
mkdocs.yml

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Co-authored-by: Pieter Gijsbers <p.gijsbers@tue.nl>

PGijsbers added 4 commits April 30, 2026 16:22

Add initial descriptions of the database

c14060a

Annotation of openml database tables

59bfaff

Add explanation on absense of ORM

6943009

Fix formatting, clarify TODOs

6bf8bbf

PGijsbers added the documentation Improvements or additions to documentation label May 1, 2026

sourcery-ai Bot reviewed May 1, 2026

View reviewed changes

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

PGijsbers commented May 2, 2026

View reviewed changes

Comment thread docs/database/openml_expdb.md Outdated

PGijsbers commented May 2, 2026

View reviewed changes

Comment thread docs/database/openml_expdb.md Outdated

Fix typos

c0e3b20

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Co-authored-by: Pieter Gijsbers <p.gijsbers@tue.nl>

PGijsbers merged commit fe55bd5 into main May 2, 2026
2 of 3 checks passed

PGijsbers deleted the docs-database branch May 2, 2026 08:30

Uh oh!

Conversation

PGijsbers commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 1, 2026 •

edited

Loading