diff --git a/docs/database/index.md b/docs/database/index.md new file mode 100644 index 00000000..f1f87ec4 --- /dev/null +++ b/docs/database/index.md @@ -0,0 +1,27 @@ +# Database Schema + +The OpenML server uses two MySQL databases: + +- **[openml](openml.md)** — Core platform database for user accounts, file storage, access control, and forum threads. +- **[openml_expdb](openml_expdb.md)** — Experiment database storing datasets, tasks, flows (implementations), runs, evaluations, and studies. + +These documentation pages describe their current schemas. +There are several tables which are no longer in use, these are mentioned but not described. +The plan is to revise the database schema after we sunset the PHP API, to avoid having to make changes to two APIs. + +When launching the services as described in ["Installation"](../installation.md), you can access the mysql server with both databases using `docker compose exec database mysql -uroot -pok`. + +## Why we use queries instead of an ORM tool +There are two main reasons why we do not use an ORM tool *yet*. + +First, we want to keep queries close to the original PHP implementation. +Not using an ORM makes it natural to stick with similar or identical queries that the PHP API uses. +Introducing an ORM, and having it construct queries for us, adds an extra layer of changes. +Performance issues may be harder to trace down if they arise. + +Second, we will likely revise the database schema significantly after the PHP API is sunset. +The current schema is over a decade old and contains design decisions based on expected future use or features +and other decisions which we may want to revise based on our experience running OpenML. +It seems easier to support those changes when we do not yet use an ORM tool. + +The expectation is that we will move to using an ORM tool once the schema is revised and stable. diff --git a/docs/database/openml.md b/docs/database/openml.md new file mode 100644 index 00000000..994b6be0 --- /dev/null +++ b/docs/database/openml.md @@ -0,0 +1,85 @@ +# openml Database + +The `openml` database contains core platform tables for user management, file storage, access control, and community features. + +## users + +Stores registered user accounts and their authentication details. +For a little while, the `username` and `email` were synonymous. + + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | mediumint unsigned | No | auto_increment | | Primary key. | 2 | +| ip_address | varchar(45) | No | | | IP address at registration. | 127.0.0.1 | +| username | varchar(100) | No | | | Unique login name. | foo@bar.com | +| password | varchar(255) | No | | | Hashed password. | - | +| email | varchar(254) | No | | | Email address. | foo@bar.com | +| activation_selector | varchar(255) | Yes | NULL | | Selector token for account activation. | | +| activation_code | varchar(255) | Yes | NULL | | Code for account activation. | | +| forgotten_password_selector | varchar(255) | Yes | NULL | | Selector token for password reset. | | +| forgotten_password_code | varchar(255) | Yes | NULL | | Code for password reset. | | +| forgotten_password_time | int unsigned | Yes | NULL | | Timestamp of password reset request. | | +| remember_selector | varchar(255) | Yes | NULL | | Selector token for "remember me" sessions. | | +| remember_code | varchar(255) | Yes | NULL | | Code for "remember me" sessions. | | +| created_on | int unsigned | No | | | Unix timestamp of account creation. | 1363880450 | +| last_login | int unsigned | Yes | NULL | | Unix timestamp of last login. | 1763344931 | +| active | tinyint unsigned | Yes | NULL | | Whether the account is activated through the confirmation email, 1 or 0. | 1 | +| first_name | varchar(50) | Yes | NULL | | User's first name. | Joaquin | +| last_name | varchar(50) | Yes | NULL | | User's last name. | van Rijn | +| company | varchar(100) | No | | | Organization or affiliation. | OpenML | +| phone | varchar(20) | Yes | NULL | | Phone number. Not in use. | 0000 | +| country | varchar(50) | No | | | Country of residence. No input validation was done. | rfr | +| image | varchar(128) | Yes | NULL | | Path to profile image. | https://www.openml.org/data//view/21794253/joa.jpeg | +| bio | text | No | | | User biography. | "My wonderful bio" | +| core | enum('true','false') | No | 'false' | | Whether the user is a core team member. | false | +| external_source | varchar(50) | Yes | NULL | | External authentication provider (e.g., OAuth). not in use | 0000 | +| external_id | varchar(50) | Yes | NULL | | User ID from external authentication provider. not in use | 0000 | +| session_hash | varchar(40) | Yes | NULL | | Hash for API session authentication. 32 digit hexadecimal | - | +| session_hash_date | timestamp | Yes | CURRENT_TIMESTAMP | | When the session hash was last generated. | 2024-10-20 20:18:54 | +| gamification_visibility | varchar(32) | No | 'show' | | Visibility setting for gamification badges. One of 'show' or 'hidden' | hidden | + +## groups + +Defines user groups for role-based access control. +Currently the database recognizes three groups: admins, normal users, and read-only users. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | mediumint unsigned | No | auto_increment | | Primary key. | 2 | +| name | varchar(20) | No | | | Group name. | members | +| description | varchar(100) | No | | | Description of the group's purpose. | normal read-write permissions | + +## users_groups + +Associates users with groups (many-to-many relationship). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | mediumint unsigned | No | auto_increment | | Primary key. | 2 | +| user_id | mediumint unsigned | No | | [users.id](#users) | The user. | 2 | +| group_id | mediumint unsigned | No | | [groups.id](#groups) | The group the user belongs to. | 2 | + +## file + +Stores metadata about uploaded files (datasets, flows, predictions, etc.). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | auto_increment | | Primary key. | 1 | +| creator | int | No | | | User ID of the uploader. | 2 | +| creation_date | datetime | No | | | When the file was uploaded. | 2015-11-30 06:48:32 | +| filepath | varchar(256) | No | | | Storage path on the server. | dataset/api/dataset_1_anneal.arff | +| filesize | int | No | | | File size in bytes. | 143338 | +| filename_original | varchar(256) | No | | | Original filename as uploaded. | dataset_1_anneal.arff | +| extension | varchar(16) | No | | | File extension (e.g., arff, csv). | arff | +| mime_type | varchar(32) | No | | | MIME type of the file. | application/octet-stream | +| md5_hash | varchar(64) | No | | | MD5 checksum for integrity verification. | 43b29a3eb09e8fac9a8525c3c83abec8 | +| type | enum('dataset','implementation','predictions','userimage','run_trace','run_uploaded_file','url','misc') | No | | | Category of the file. | dataset | +| access_policy | enum('public','private','none','deleted') | No | 'public' | | Access control policy for the file. | public | + + +## deprecated tables +There are also `category` and `thread` tables which were designed for a forum feature but are not used. +The `meta_dataset` table is for requesting automated metadata set building, a feature which is not enabled. +The `access` table is not used, access constraints are currently handled by columns in the respective table (e.g., `dataset.visibility` for datasets). diff --git a/docs/database/openml_expdb.md b/docs/database/openml_expdb.md new file mode 100644 index 00000000..f0f82aba --- /dev/null +++ b/docs/database/openml_expdb.md @@ -0,0 +1,851 @@ +# openml_expdb Database + +The `openml_expdb` database contains all experiment-related data: datasets, tasks, flows (implementations), runs, evaluations, and studies. +The "expdb" part stands for "experiment database", the name used in [Joaquin Vanschoren's thesis](https://research.kuleuven.be/portal/en/project/3E061119). + +Some remarks which apply generally: + + - `datetime` fields are in format (`YYYY-MM-DD hh:mm:ss`). + - some `varchar` columns in production only have a very limited set of values. If the description says "one of.." it denotes only those values are present in the database in production. + +There are a few tables which never were or no longer are in use. +They will be removed and so are not documented here. They are: `data_quality_interval`, `feature_quality`, `output_data` (only present on the test server), `algorithm` and `algorithm_quality`. +For more information, see [server-api#165](https://github.com/openml/server-api/issues/165). + +--- + +## Datasets + +### dataset + +Stores dataset metadata including source, format, licensing, and upload information. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int unsigned | No | auto_increment | | Primary key (dataset ID). | 42 | +| uploader | mediumint unsigned | Yes | NULL | [openml.users.id](openml.md#users) | User who uploaded the dataset. | 2 | +| source | int unsigned | Yes | NULL | | ID of the original source dataset if derived. Used only rarely. | 3 | +| name | varchar(128) | No | | | Dataset name. | iris | +| version | varchar(64) | No | | | Version string, displayed on the dataset page. | 3 | +| version_label | varchar(128) | Yes | NULL | | Human-readable label. | tabular-ml-iid-study-0.0.1 | +| format | varchar(64) | No | 'arff' | | File format, one of: ARFF, Sparse_ARFF, CSV, "CSV, MAT", txt, Rimage | ARFF | +| creator | text | Yes | NULL | | Original creator(s) of the dataset. | "Jason", "G. Davies, A. Horne" | +| contributor | text | Yes | NULL | | People who contributed to the current version (e.g., formatting). | "A. Dent", "The Data Institute" | +| collection_date | varchar(128) | Yes | NULL | | When the data was originally collected, in any format. | "1980", "2024-10-30", "2022, 5 May" | +| upload_date | datetime | No | | | When the dataset was uploaded to OpenML. | 2014-04-06 23:19:20 | +| language | varchar(128) | Yes | NULL | | Language of the dataset content. | "En", "Dutch" | +| licence | varchar(64) | No | 'Public' | | License under which the dataset is shared. | Public | +| citation | text | Yes | NULL | | How to cite this dataset, sometimes a link to the policy. | Bareiss, E. Ray, & \[...\] Proceedings of \[...\] | +| collection | varchar(64) | Yes | NULL | | Collection the dataset belongs to. Not used. | NULL | +| url | mediumtext | No | | | URL to download the dataset file. Most often links to OpenML, but not always. | https://openml.org/data/download/22126628/test.arff | +| parquet_url | mediumtext | Yes | NULL | | *NOT IN PRODUCTION* URL to Parquet version of the dataset. | | +| isOriginal | enum('true','false') | Yes | NULL | | Whether this is an original dataset (not derived). | true | +| file_id | int | Yes | NULL | [openml.file.id](openml.md#file) | Reference to the uploaded file. | 60 | +| default_target_attribute | varchar(1024) | Yes | NULL | | Name of the default target column. Allows csv. | class | +| row_id_attribute | varchar(128) | Yes | NULL | | Name of the row identifier column. Allows csv. | id | +| ignore_attribute | varchar(128) | Yes | NULL | | Columns to ignore during modeling. Allows csv. | "animal", "'url_hash', 'query_id'" | +| paper_url | mediumtext | Yes | NULL | | URL to an associated publication. | https://arxiv.org/html/2402.55618v3 | +| visibility | varchar(128) | No | 'public' | | Visibility level (public, private, or friends). Note non-public visibility are currently not supported. | public | +| original_data_id | int | Yes | NULL | | ID of the original dataset this was derived from. | 23 | +| original_data_url | mediumtext | Yes | NULL | | URL to the original data source. | https://zenodo.org/record/322475/files/bike.arff | +| update_comment | text | Yes | NULL | | Comment explaining the latest update. | fixed features | +| last_update | datetime | Yes | NULL | | Timestamp of the last update. | 2017-10-28 23:42:18 | + + +### dataset_description + +Stores versioned descriptions for datasets. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int unsigned | No | | [dataset.did](#dataset) | Dataset this description belongs to. | 42 | +| version | int unsigned | No | | | Description version number. | 2 | +| description | text | No | | | The dataset description text. | This dataset describes... | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who wrote this description. | 3 | + +### dataset_status + +Tracks the current status of the dataset. +The absence of an entry in this table for a dataset denotes that the dataset is "in preparation". +Rows are removed from this table to indicate transitions (deactivated -> activate)! +It does *not* provide a historical record of the dataset status. + +Allowed transitions are (as defined by the PHP implementation): + + - in preparation -> activate + - in preparation -> deactivated + - deactivated -> activate + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int unsigned | No | | [dataset.did](#dataset) | Dataset ID. | 42 | +| status | enum('active','deactivated') | No | | | Status value. | active | +| status_date | datetime | No | | | When the status was set. | 2022-04-10 11:10:42 | +| user_id | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who changed the status. | 1 | + +### dataset_tag + +User-assigned tags on datasets for categorization and search. +Note that historically, collections used tags (e.g., `study_14` indicates the dataset is linked to study 14). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int unsigned | No | | [dataset.did](#dataset) | Dataset ID. | 3 | +| tag | varchar(255) | No | | | Tag string. | Medicine, study_14 | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who added the tag. | 4 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the tag was added. | 2018-11-04 08:57:17 | + +### dataset_topic + +Assigns topic labels to datasets. +Topics are displayed as tags on the web page. +This is the result of an experiment in 2021 to try to categorize datasets to better facilitate search. +Topics have been added by automated analysis. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int unsigned | No | | [dataset.did](#dataset) | Dataset ID. | 5 | +| topic | varchar(255) | No | | | Topic label. | Health | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who assigned the topic. | 8111 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the topic was assigned. | 2021-06-01 12:32:45 | + +--- + +## Data Features and Qualities + +### data_feature + +Stores metadata and statistics for each feature (column) of a dataset, as computed by an evaluation engine. + +!!! bug "What's going on?" + + We need more documentation for the computed columns `NumberOf*` and `ClassDistribution`. + From just the database values, it's unclear what's going on. + For example, 'nominal' features have their `NumberOfIntegerValues` column populated. + + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int unsigned | No | | [dataset.did](#dataset) | Dataset ID. | 2 | +| index | int unsigned | No | | | Feature index (column position), 0-indexed. | 0 | +| evaluation_engine_id | int | No | | [evaluation_engine.id](#evaluation_engine) | Engine that computed the feature metadata. | 1 | +| name | varchar(64) | No | | | Feature name. | petal-width | +| data_type | varchar(64) | Yes | NULL | | Data type (numeric, nominal, string, date, unknown). | numeric | +| is_target | enum('true','false') | No | 'false' | | Whether this is the default target feature. | true | +| is_row_identifier | enum('true','false') | No | 'false' | | Whether this feature is a row ID. | false | +| is_ignore | enum('true','false') | No | 'false' | | Whether this feature should be ignored. | false | +| NumberOfDistinctValues | int | Yes | NULL | | Number of distinct values. | 14972 | +| NumberOfUniqueValues | int | Yes | NULL | | Number of values that appear exactly once. | 0 | +| NumberOfMissingValues | int | No | | | Count of missing values. | 124 | +| NumberOfIntegerValues | int | Yes | NULL | | Count of integer values. | | +| NumberOfRealValues | int | Yes | NULL | | Count of real-valued entries. | | +| NumberOfNominalValues | varchar(512) | No | | | Number of nominal categories (or distribution). | | +| NumberOfValues | int | No | | | Total number of values. | | +| MaximumValue | int | Yes | NULL | | Maximum value (for numeric features). | | +| MinimumValue | int | Yes | NULL | | Minimum value (for numeric features). | | +| MeanValue | int | Yes | NULL | | Mean value (for numeric features). | | +| StandardDeviation | int | Yes | NULL | | Standard deviation (for numeric features). | | +| ClassDistribution | text | Yes | NULL | | Only for nominal features, otherwise '[]'. See below. | [["?","T"],[[7, 99, 586, 0, 67, 2],[1, 0, 98, 0, 0, 38]]] | + +ClassDistribution is `[]` for all features except nominal ones, and only populated if the dataset has a target feature that is also nominal (note that a dataset can have multiple target features). +For nominal features, it is a U x C matrix where U is the number of unique values in the feature and C is the number of classes of the target feature of the dataset. +Each cell specifies how often the feature value occurs for the specific target value. +The behavior with multiple target features that are nominal is unspecified. + +### data_feature_description + +User-provided descriptions and ontology annotations for individual data features. +The table is empty in production (as of 2026-04-29) and should be considered experimental. +This feature was added recently (2025ish) should be considered experimental. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int unsigned | No | | [data_feature(did, index)](#data_feature) | Dataset ID. | 2 | +| index | int unsigned | No | | [data_feature(did, index)](#data_feature) | Feature index (column position, 0-indexed). | 0 | +| uploader | mediumint unsigned | No | | | User who added the description. | 1548 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the description was added. | 2025-12-12 09:29:30 | +| description_type | enum('plain','ontology') | No | | | Type of description. | 'ontology' | +| value | varchar(256) | No | | | The description or ontology URI. | http://xmlns.com/foaf/0.1/#name | + +### data_feature_value + +Enumerates the distinct values of a feature (only for nominal features). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int unsigned | No | | [data_feature(did, index)](#data_feature) | Dataset ID. | 2 | +| index | int unsigned | No | | [data_feature(did, index)](#data_feature) | Feature index. | 3 | +| value | varchar(256) | No | | | A distinct value of the feature. | 'tulip' | + +### data_quality + +Stores computed quality measures (meta-features) for datasets, such as number of instances or entropy. +Measures are computed by an evaluation engine. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| data | int unsigned | No | 0 | [dataset.did](#dataset) | Dataset ID. | 2 | +| quality | varchar(128) | No | | [quality.name](#quality) | Name of the quality measure. | 'AutoCorrelation' | +| evaluation_engine_id | int | No | | [evaluation_engine.id](#evaluation_engine) | Engine that computed the quality. | 1 | +| value | varchar(128) | Yes | NULL | | Computed value. | 0.5 | +| description | text | Yes | NULL | | Additional description or notes. | | + + +### data_processed + +Tracks whether a dataset has been processed by an evaluation engine (feature extraction, quality computation). +Note that this table is used with in-place edits. Retrying a failed will increment `num_tries`, not add a new row. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int unsigned | No | | [dataset.did](#dataset) | Dataset ID. | 1 | +| evaluation_engine_id | int | No | | [evaluation_engine.id](#evaluation_engine) | Engine that processed the dataset. | 1 | +| user_id | int | No | | | User who triggered processing. | 1 | +| processing_date | datetime | No | | | When processing completed. | 2020-02-04 12:45:01 | +| error | text | Yes | NULL | | Error message if processing failed. | "keyword @relation expected, read Token[Id], line 1" | +| warning | text | Yes | NULL | | Warning messages from processing. Always NULL on prod. | NULL | +| num_tries | int | No | 1 | | Number of processing attempts. | 3 | + + +### quality + +Defines the available dataset quality measures (meta-features) and their properties. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| name | varchar(128) | No | | | Primary key. Name of the quality measure. | AutoCorrelation | +| type | varchar(128) | No | 'DataProperty' | | Category, one of DataQuality, FeatureQuality, AlgorithmQuality. | FeatureQuality | +| formula | text | Yes | NULL | | Formula or computation method. Always NULL except for 1 row. | Calculated using... | +| description | text | Yes | NULL | | Human-readable description. | Average class difference between consecutive instances. | +| datatype | varchar(128) | No | 'undefined' | | Expected data type of the value. One of 'double', 'integer', 'undefined'. | | +| min | float | Yes | NULL | | Minimum possible value. | 0 | +| max | float | Yes | NULL | | Maximum possible value. | 1.0 | +| unit | varchar(32) | Yes | NULL | | Unit of measurement. | NULL | +| priority | int | No | 9999 | | Display priority (lower is higher priority). | 2 | +| showonline | enum('true','false') | No | | | Whether to show on the website. | true | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the quality was defined. | 2014-12-31 23:00:00 | + + +--- + +## Tasks + +To more easily understand the way tasks are structured, we break down tasks into three components: + + - the task type: concept of what the task is (e.g., Supervised Classification) and the `task_type_inout` specifies the required and optional inputs and outputs for such a task. For example, a Supervised Classification task must have (among others) a target feature specified. + - the estimation procedure: the experimental setup to evaluate model quality for the task (e.g., 10-fold Cross Validation) + - the `task` together with its `task_inputs` specify concrete values for the task (e.g., 10-fold Cross Validation on Iris predicting Class). + +For example, [task 59](https://www.openml.org/t/59) (10-fold Cross Validation on Iris) has the `task` row in the database (`task_id=59, ttid=1, ...`): + + 1. It is of task type 1, Supervised Classification (from resolving its `ttid` column against the `task_type` table). + 2. When the task was created, the creator had to specify the following input since they are `required` for the task type as dictated by `task_type_inout`: + - source data: to which dataset is the task linked? + - estimation procedure: what is the experimental setup for the runs? + - target feature: what is the target to predict? + 3. It could have optionally also specified the following properties, as indicated by the `task_type_inout` table: cost matrix, custom testset, evaluation measures. + 4. The `task_inputs` table is to specify what the corresponding values are for this task. E.g., the row with input "source\_data" has value "61" specifying that dataset 61 is used. It similarly defines the estimation procedure, target feature, and also the optional input of "evaluation measure". That is to say, `task_inputs` primarily references other tables. + 5. The estimation procedure required by the task type, and given a value in `task_inputs`, must correspond to an entry in the `estimation_procedure` table. E.g., if we find (`estimation_procedure`, `1`) as a task input, we know it is 10-fold Cross Validation (the row with `id=1` in `estimation_procedure`). + 6. `task_type_inout` further specifies that they expect `output` of experiments of the task may contain: evaluation measures, model, predictions. This is not directly used for tasks, but rather for runs. + +So there are constraints from the `task_inputs` table to multiple other tables (`dataset`, `estimation_procedure`, ...) that are not explicitly present in the database. +The `task_inputs` is also expected to be populated with entries for each task dependent on the respective input defined by the `task_type_inout` table. +The `estimation_procedure_type` table provides general descriptions for procedures (such as Hold out), they are matched by name even though the relationship is not explicit in the database. + +### task_type + +Defines the types of machine learning tasks available (e.g., classification, regression). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| ttid | int | No | auto_increment | | Primary key (task type ID). | 1 | +| name | varchar(128) | No | | | Task type name. | Supervised Classification | +| description | text | No | | | Description of the task type. | Predict the value of a nominal feature given the other features. | +| creator | varchar(128) | No | | | Who defined this task type. | "Alice Smith, John Hickey" | +| contributors | text | Yes | NULL | | Additional contributors. | "Piet Houten, Betty Boss" | +| creationDate | datetime | No | | | When the task type was created. | 2017-09-28 11:23:09 | + +### task_type_inout + +Defines the input and output specifications for each task type. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| ttid | int | No | | [task_type.ttid](#task_type) | Task type ID. | 1 | +| name | varchar(64) | No | | | Name of the input/output parameter. See note. | source_data | +| type | varchar(64) | No | | | Data type of the parameter, from [task_io_types](#task_io_types). | String, "Estimation Procedure" | +| io | enum('input','output') | No | | | Whether this is an input or output. | input | +| requirement | enum('required','optional','hidden') | No | | | Whether this parameter is required. | required | +| description | varchar(256) | No | | | Description of the parameter. | "This input is required to foo the bar." | +| order | int | No | | | Display order used by the frontend. | 29 | +| api_constraints | text | Yes | NULL | | API-level constraints on this parameter as JSON. | See below. | +| template_api | text | Yes | NULL | | Template for API representation. | See below. | +| template_search | text | Yes | NULL | | Template for search representation. | See below. | + + +The `api_constraints`, `template_api` and `template_search` columns contain values that are used by PHP to help format responses. + +The `api_constraints` contains JSON with optionally some special instructions: +```json +{ +"data_type": "string", +"select": "name", +"from": "data_feature", +"where": "did = '[INPUT:source_data]' AND data_type = 'nominal'" +} +``` +Production uses the following directives: + + - `[INPUT:source_data]`: look up the value of `task_inputs.value` where `input="source_data"` and `task_id` matches the task. + - `[TASK:ttid]`: look up the value of `task.ttid` for that task. + +The `template_api` contains XML instead: +```xml + +[INPUT:estimation_procedure] +[LOOKUP:estimation_procedure.type] +[CONSTANT:base_url]api_splits/get/[TASK:id]/Task_[TASK:id]_splits.arff +[LOOKUP:estimation_procedure.repeats] +[LOOKUP:estimation_procedure.folds] +[LOOKUP:estimation_procedure.percentage] +[LOOKUP:estimation_procedure.stratified_sampling] + +``` +Additionally contains the directives: + + - `[LOOKUP:*]`: looks up the `TABLE.row` value that are associated with the task. + - `[CONSTANT:*]`: look up constants configured in PHP, not in the database. + +Finally, the `template_search` contains JSON again: +```json +{ + "name": "Dataset(s)", + "autocomplete": "commaSeparated", + "datasource": "expdbDatasetVersion()", + "placeholder": "(*) include all datasets" +} +``` + +The `expdbDatasetVersion()` function is no longer used. + +### task_io_types + +Defines the valid types for task inputs and outputs. +Used in the `task_type_inout` table. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| name | varchar(64) | No | | | Primary key. Type name. | String | +| description | text | No | | | Description of this I/O type. | A string, possibly contains csv values. | + +### task + +Represents a specific machine learning task. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| task_id | int | No | auto_increment | | Primary key. | 59 | +| ttid | int | No | | [task_type.ttid](#task_type) | Task type. | 1 | +| creator | mediumint unsigned | Yes | NULL | [openml.users.id](openml.md#users) | User who created the task. | 1 | +| creation_date | datetime | Yes | NULL | | When the task was created. | 2014-11-02 03:12:15 | +| embargo_end_date | datetime | Yes | NULL | | End date of any data embargo. | 2025-10-30 23:52:28 | + +### task_inputs + +Stores the input parameter values for a specific task instance. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| task_id | int | No | | [task.task_id](#task) | Task ID. | 59 | +| input | varchar(64) | No | | | Input parameter name. | source_data | +| value | text | No | | | Input parameter value. | 61 | + +Valid `input` values depend on the `task_type_inout` for the `task_type` that's specified in the `task`. + +### task_tag + +User-assigned tags on tasks. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | | [task.task_id](#task) | Task ID. | 59 | +| tag | varchar(255) | No | | | Tag string. | uci | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who added the tag. | 2 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the tag was added. | 2022-10-09 17:18:19 | + +### estimation_procedure_type + +Defines the types of estimation procedures (e.g., cross-validation, holdout). +Variations are defined in the `estimation_procedure` table. +E.g., this table describes "cross validation" and `estimation_procedure` describes `10-fold cross validation`, `5-fold cross validation`, and so on. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| name | varchar(64) | No | | | Primary key. Procedure type name. | crossvalidation | +| description | text | No | | | Description of the procedure type. | a process where the dataset is divided into folds, ... | + +### estimation_procedure + +Defines specific estimation procedure configurations used in tasks. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | auto_increment | | Primary key. | 1 | +| ttid | int | No | | [task_type.ttid](#task_type) | Task type this procedure applies to. | 1 | +| name | varchar(128) | No | | | Procedure name. | 10-fold cross validation | +| type | enum('crossvalidation','leaveoneout','holdout','holdout_ordered','bootstrapping','subsampling','testthentrain','holdoutunlabeled','customholdout','testontrainingdata') | No | | | Procedure type. | crossvalidation | +| repeats | int | Yes | NULL | | Number of repetitions. | 1 | +| folds | int | Yes | NULL | | Number of folds. | 10 | +| samples | enum('false','true') | No | 'false' | | Whether learning curve samples are used. | false | +| percentage | int | Yes | NULL | | Train/test split percentage. | NULL | +| stratified_sampling | enum('true','false') | Yes | NULL | | Whether stratified sampling is used. | true | +| custom_testset | enum('true','false') | No | 'false' | | Whether a custom test set is provided. | false | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the procedure was created. | 2020-02-20 20:02:20 | + +--- + +## Flows (Implementations) +The database uses historical name 'implementation' for a flow. + +### implementation + +Stores machine learning flows (algorithms/pipelines) that can be executed on tasks. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | auto_increment | | Primary key (flow ID). | 1 | +| fullName | varchar(1024) | No | | | Fully qualified name (name + version). | sklearn.tree.DecisionTreeClassifier(1) | +| uploader | mediumint unsigned | Yes | NULL | [openml.users.id](openml.md#users) | User who uploaded the flow. | 2 | +| name | varchar(1024) | No | | | Flow name. | sklearn.tree.DecisionTreeClassifier | +| custom_name | varchar(256) | Yes | NULL | | User-defined display name. | Tree | +| class_name | varchar(256) | Yes | NULL | | Class name in the source library. | nl.liacs.subdisc.SubgroupDiscovery | +| version | int | No | | | Internal version number. | 1 | +| external_version | varchar(128) | No | | | Version string from the source library. | "sklearn=0.12.3,numpy=0.22.1" | +| creator | varchar(128) | Yes | NULL | | Original creator of the algorithm. | Arlo Knoppen | +| contributor | text | Yes | NULL | | Additional contributors. | "Marcel Alder" | +| uploadDate | datetime | No | | | When the flow was uploaded. | 2015-12-21 18:28:38 | +| licence | varchar(64) | Yes | NULL | | License. | public domain | +| language | varchar(128) | Yes | NULL | | Language the description is written in. Sometimes used to specify programming language instead. | English | +| description | text | Yes | NULL | | Short description. | Common Decision Tree algorithm | +| fullDescription | text | Yes | NULL | | Full description. | This algorithm partitions the data... | +| installationNotes | text | Yes | NULL | | Installation instructions. Not really used. | Runs on OpenML | +| dependencies | text | Yes | NULL | | Required dependencies. | mlr\_2.3 | +| implements | varchar(128) | Yes | NULL | | Algorithm or standard this flow implements. Seems to be legacy | build\_cpu\_time | +| binary_file_id | int | Yes | NULL | | File ID of the compiled binary. | 32 | +| source_file_id | int | Yes | NULL | | File ID of the source code. | 33 | +| visibility | varchar(128) | No | 'public' | | Visibility level. One of 'public' or 'private' | public | +| citation | text | Yes | NULL | | How to cite this flow. Only used once. | "A. Boo et al., Journal of ..." | + + + +### implementation_component + +Defines parent-child relationships between flows (e.g., a pipeline containing sub-flows). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| parent | int | No | | [implementation.id](#implementation) | Parent flow ID. | 1 | +| child | int | No | | [implementation.id](#implementation) | Child (component) flow ID. | 2 | +| identifier | varchar(1024) | Yes | NULL | | Role or name of the component within the parent. | PCA, scaler | + +### implementation_tag + +User-assigned tags on flows. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | | [implementation.id](#implementation) | Flow ID. | 2 | +| tag | varchar(255) | No | | | Tag string. | Verified\_Supervised\_Classification | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who added the tag. | 2 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the tag was added. | 2020-06-08 09:20:28 | + +### input + +Defines the hyperparameters (input parameters) of a flow. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | auto_increment | | Primary key. | 2 | +| implementation_id | int | No | | [implementation.id](#implementation) | Flow this parameter belongs to. | 1 | +| name | varchar(512) | Yes | NULL | | Parameter name. | C | +| description | text | Yes | NULL | | Parameter description. | Regularization parameter. | +| dataType | varchar(255) | Yes | NULL | | Expected data type. | float | +| defaultValue | text | Yes | NULL | | Default value. | 1.0 | +| recommendedRange | text | Yes | NULL | | Recommended value range. | 10e-3 to 10e3 | + +--- + +## Setups + +Setups are really part of a run, they specifically describe the algorithm (flow) configuration used in the run. +They are not a separate entity. Each row in the `run` table refers to a setup. A setup may be shared by multiple runs. + + +### algorithm_setup + +Represents a specific configuration (hyperparameter setting) of a flow, including nested component setups. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| sid | int unsigned | No | auto_increment | | Primary key (setup ID). | 1 | +| parent | int unsigned | No | | | Parent setup ID (for nested components). | NULL | +| implementation_id | int | No | | [implementation.id](#implementation) | Flow this setup configures. | 3 | +| algorithm | varchar(128) | Yes | NULL | | Algorithm name. Always NULL is prod! | NULL | +| role | varchar(64) | No | 'Learner' | | Role of this component (e.g., Learner, Preprocessor). | learner | +| isDefault | enum('true','false') | Yes | 'false' | | Whether this is the default setup. | true | +| algorithm_structure | varchar(64) | Yes | NULL | | Not sure. Always NULL in prod. | NULL | +| setup_string | text | Yes | NULL | | Serialized setup string. Can aid reproducing the run. | "weka.classifiers.trees.J48 -- -C 0.25 -M 2" | + +### input_setting + +Stores the actual hyperparameter values for a specific setup. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| setup | int unsigned | No | 0 | [algorithm_setup.sid](#algorithm_setup) | Setup ID. | 1 | +| input | varchar(128) | No | | | Parameter name. Only old setups have this. | 428_I | +| input_id | int | No | | [input.id](#input) | Reference to the parameter definition. | 4124 | +| value | varchar(2048) | No | | | Parameter value. | 1000, gini | + +### setup_tag + +User-assigned tags on setups. +Setups can not really be assigned by users, only admins. Runs should be tagged by users. + +These tags are only used in the filtering of `/setups/list`, they are not returned with a `setup` or `run`. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int unsigned | No | | [algorithm_setup.sid](#algorithm_setup) | Setup ID. | 42 | +| tag | varchar(255) | No | | | Tag string. | test_setup | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who added the tag. | 2 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the tag was added. | 2024-12-01 23:04:57 | + +### setup_differences + +Precomputed pairwise differences between setups on specific tasks. +Not entirely sure what this means, but seems to be out of use (last setup included is from a run in 2019). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| sidA | int | No | | | First setup ID. | 2 | +| sidB | int | No | | | Second setup ID. | 3 | +| task_id | int | No | | | Task ID. | 1245 | +| task_size | int | No | | | Size of the task dataset. | 100000 | +| differences | int | No | | | ??? | 812987 | + +--- + +## Runs and Evaluations + +### run + +Represents the execution of a flow setup on a task. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| rid | int unsigned | No | auto_increment | | Primary key (run ID). | 2 | +| uploader | mediumint unsigned | Yes | NULL | [openml.users.id](openml.md#users) | User who uploaded the run. | 16 | +| setup | int unsigned | No | 0 | [algorithm_setup.sid](#algorithm_setup) | Setup (hyperparameter configuration) used. | 2894 | +| task_id | int | Yes | NULL | [task.task_id](#task) | Task the run was executed on. | 284 | +| start_time | datetime | Yes | NULL | | When the run started. | 2018-07-31 10:57:42 | +| error_message | text | Yes | NULL | | Error message if the run failed. | "weka.classifiers.bayes.HNB: Cannot handle numeric attributes!" | +| run_details | text | Yes | NULL | | Additional run details or logs. Not really used. | "Custom:Hyperparameter" | +| visibility | varchar(128) | No | 'public' | | Visibility level, 'public' or 'private'. | public | + +### run_evaluated + +Tracks whether a run has been evaluated by an evaluation engine. +Currently only one evaluation engine is evaluating runs. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| run_id | int unsigned | No | | [run.rid](#run) | Run ID. | 24 | +| evaluation_engine_id | int | No | | [evaluation_engine.id](#evaluation_engine) | Engine that evaluated the run. | 1 | +| user_id | int | No | | | User who triggered evaluation. | 381 | +| evaluation_date | datetime | No | | | When evaluation completed. | 2024-02-04 08:09:27 | +| error | text | Yes | NULL | | Error message if evaluation failed. | "Index 1 out of bounds for length 1" | +| warning | text | Yes | NULL | | Warning messages from evaluation. | "Inconsistent Evaluation score: ..." | +| num_tries | int | No | 1 | | Number of evaluation attempts. | 1 | + +### runfile + +Stores metadata about files associated with a run (e.g., predictions, model files). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| source | int unsigned | No | | [run.rid](#run) | Run ID. | 28 | +| field | varchar(128) | No | | | File field name (e.g., predictions, model serialized). | predictions | +| name | varchar(128) | No | | | Original filename. | weka_generated_run5258986433356798974.xml | +| format | varchar(128) | No | | | File format. | xml | +| upload_time | datetime | Yes | NULL | | When the file was uploaded. | 2017-05-29 19:28:56 | +| file_id | int | No | | [openml.file.id](openml.md#file) | Reference to the file record. | 152 | + +### run_tag + +User-assigned tags on runs. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int unsigned | No | | [run.rid](#run) | Run ID. | 1521 | +| tag | varchar(255) | No | | | Tag string. | micro_benchmark | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who added the tag. | 124 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the tag was added. | 2024-10-24 22:58:18 | + +### input_data + +Records the input datasets used by a run. +Used by PHP, but arguably they could use the relationship run->task->task_inputs->"source_data" instead. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| run | int | No | | | Run ID. | 21 | +| data | int | No | | | Dataset ID. | 42 | +| name | varchar(128) | No | 'inputdata' | | Always 'dataset' | dataset | + +### evaluation_engine + +Defines the evaluation engines that compute metrics on runs and datasets. +Currently only has one row, as we only ever used on engine in production. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | auto_increment | | Primary key. | 1 | +| name | varchar(256) | No | | | Engine name. | weka_engine | +| description | text | No | | | Engine description. | "Default OpenML evaluation engine" | + +### math_function + +Defines evaluation metrics (e.g., accuracy, AUC, RMSE) and their properties. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | auto_increment | | Primary key. | 1 | +| name | varchar(64) | No | | | Metric name (unique). | EuclidianDistance | +| functionType | varchar(128) | No | 'EvaluationFunction' | | Type of function. One of Metric, KernelFunction, or EvaluationFunction | Metric | +| min | varchar(64) | No | | | Minimum possible value. | 0 | +| max | varchar(64) | No | | | Maximum possible value. | '' | +| unit | varchar(64) | No | | | Unit of measurement. | seconds, bytes | +| higherIsBetter | varchar(16) | Yes | NULL | | Whether higher values indicate better performance. | true, Yes, 1, False | +| description | text | Yes | NULL | | Description of the metric. | "The area under the ROC..." | +| source_code | text | No | | | Source code or formula. | "public double truePositiveRate(..." | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the function was defined. | 2024-04-19 23:51:52 | + +### evaluation + +Stores aggregated evaluation results for a run (one value per metric). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| source | int unsigned | No | | [run.rid](#run) | Run ID. | 1 | +| function_id | int | No | | [math_function.id](#math_function) | Evaluation metric. | 4 | +| evaluation_engine_id | int | No | | [evaluation_engine.id](#evaluation_engine) | Engine that computed the evaluation. | 1 | +| value | double | Yes | NULL | | Aggregated metric value. | 0.839 | +| stdev | double | Yes | NULL | | Standard deviation across folds/repeats. | 0.06 | +| array_data | text | Yes | NULL | | Per-class or detailed results as array. | "[0.0,0.99113,0.898048,0.874862,0.791282,0.807343,0.820674]" | + +### evaluation_fold + +Stores per-fold evaluation results for a run. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| source | int unsigned | No | | [run.rid](#run) | Run ID. | 12 | +| function_id | int | No | | [math_function.id](#math_function) | Evaluation metric. | 4 | +| evaluation_engine_id | int | No | | [evaluation_engine.id](#evaluation_engine) | Engine that computed the evaluation. | 1 | +| fold | int unsigned | No | 0 | | Fold number. | 0 | +| repeat | int unsigned | No | 0 | | Repeat number. | 0 | +| value | double | Yes | NULL | | Metric value for this fold/repeat. | 0.5 | +| array_data | text | Yes | NULL | | Per-class or detailed results as array. | "[0.4, 0.6]" | + +### evaluation_sample + +Stores per-sample evaluation results (for learning curves). +Evaluation samples seem to be part of a run upload, but might not be able to be retrieved with the current PHP API. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| source | int unsigned | No | | [run.rid](#run) | Run ID. | 12 | +| function_id | int | No | | [math_function.id](#math_function) | Evaluation metric. | 4 | +| evaluation_engine_id | int | No | | [evaluation_engine.id](#evaluation_engine) | Engine that computed the evaluation. | 1 | +| repeat | int unsigned | No | 0 | | Repeat number. | 0 | +| fold | int unsigned | No | 0 | | Fold number. | 0 | +| sample | int unsigned | No | 0 | | Sample index. | 0 | +| sample_size | int | No | | | Number of training instances in this sample. | 100 | +| value | double | Yes | NULL | | Metric value for this sample. | 0.5 | +| array_data | text | Yes | NULL | | Per-class or detailed results as array. | "[0.4,0.6]" | + +### trace + +Stores optimization traces (e.g., hyperparameter search iterations) for a run. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| run_id | int unsigned | No | | [run.rid](#run) | Run ID. | 2 | +| evaluation_engine_id | int | No | 1 | [evaluation_engine.id](#evaluation_engine) | Engine that evaluated the trace. | 1 | +| repeat | int | No | | | Repeat number. | 0 | +| fold | int | No | | | Fold number. | 0 | +| iteration | int | No | | | Iteration number in the optimization. | 0 | +| setup_string | text | No | | | Hyperparameter configuration tried. | '{"parameter_minNumObj":"1","parameter_confidenceFactor":"0.1"}' | +| evaluation | varchar(265) | No | | | Evaluation result for this iteration. | 94.12 | +| selected | enum('true','false') | No | | | Whether this was the selected configuration. | true | + +--- + +## Studies + +Studies have historically been in flux, but they are generally collections of objects (e.g., tasks). +One major change during OpenML's lifetime was in how those collections were defined, which happened around 2019. +Nowadays there are dedicated tables (e.g., study_task) to make the connection between a study and its task. +Historically, what is now referred to as a "legacy study", this association was achieved through tags. +E.g., all tasks with tag `study_14` would be considered part of `study_14`. +Some studies are still defined this way, and migration of these studies to new-style studies is planned. +The Python-based REST API will not support legacy-style studies (hence the data migration needs to occur to make the legacy studies usable with the new API). + +The current use is that "benchmark suite" refers to a collection of tasks and a "benchmark study" refers to a collection of runs. +Both are found in the "study" table. It is likely we drop this distinction in the future in favor of a more general "collection". + +### study + +Represents a collection of runs or tasks, used for benchmarking and reproducibility. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| id | int | No | auto_increment | | Primary key (study ID). | 2 | +| alias | varchar(32) | Yes | NULL | | Short unique alias for the study. | automl_benchmark | +| main_entity_type | enum('run','task') | No | 'run' | | Whether the study collects runs or tasks. | tasks | +| benchmark_suite | int | Yes | NULL | [study.id](#study) | Reference to a task study used as benchmark suite. | 2 | +| name | varchar(256) | No | | | Study name. | "A friendly benchmarking suite for AutoML systems" | +| description | text | No | | | Study description. | "See also JMLR..." | +| visibility | varchar(64) | No | 'public' | | Visibility level. Only ever 'public' | public | +| status | enum('in_preparation','active','deactivated') | No | 'in_preparation' | | Current status. | active | +| creation_date | datetime | No | | | When the study was created. | 2021-02-04 20:48:58| +| creator | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who created the study. | 3 | +| legacy | enum('y','n') | No | 'y' | | Whether this is a legacy study. | y | + + +### study_tag + +Tags applied to studies, with optional time windows and access control. +It is unclear why this table has a different design than the other tag tables, but it likely has to do with the "legacy" study setup. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| study_id | int | No | | [study.id](#study) | Study ID. | 28 | +| tag | varchar(255) | No | | | Tag string. | published, jmlr | +| window_start | datetime | Yes | NULL | | Start of the tag's validity window. | 2020-01-01 12:23:42 | +| window_end | datetime | Yes | NULL | | End of the tag's validity window. | 2021-04-11 19:54:24 | +| write_access | enum('private','public') | No | 'private' | | Who can add this tag. | private | + +### run_study + +A join table for the "new" style studies. +Associates runs with studies (many-to-many). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| study_id | int | No | | [study.id](#study) | Study ID. | 2 | +| run_id | int unsigned | No | | [run.rid](#run) | Run ID. | 102984 | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who added the run to the study. | 1815 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the run was added. | 2024-05-12 23:10:54 | + +### task_study + +A join table for the "new" style studies. +Associates tasks with studies (many-to-many). + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| study_id | int | No | | [study.id](#study) | Study ID. | 5 | +| task_id | int | No | | [task.task_id](#task) | Task ID. | 18257 | +| uploader | mediumint unsigned | No | | [openml.users.id](openml.md#users) | User who added the task to the study. | 12 | +| date | timestamp | No | CURRENT_TIMESTAMP | | When the task was added. | 2022-04-28 08:28:49 | + +--- + +## Community + +The `awarded_badges`, `likes`, `downvotes` and `downvote_reasons` tables are not currently in use. +They may be added back with the new frontend (except for `awarded_badges`). + +### likes + +Records user likes on datasets, flows, runs, or other entities. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| lid | int | No | auto_increment | | Unique ID. | 1 | +| knowledge_type | varchar(1) | No | | | Entity type code (e.g., 'd' for dataset). | d, f, r, t| +| knowledge_id | int | No | | | Entity ID. | 20248 | +| user_id | mediumint | No | | | User who liked the entity. | 241 | +| time | timestamp | No | CURRENT_TIMESTAMP | | When the like was recorded. | 2019-09-08 17:47:57 | + +### downvotes + +Records user downvotes on entities with a reason. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int | No | auto_increment | | Unique ID. | | +| knowledge_type | varchar(1) | No | | | Entity type code. | | +| knowledge_id | int | No | | | Entity ID. | | +| user_id | mediumint | No | | | User who downvoted. | | +| reason | int | No | | | Reason for the downvote (references downvote_reasons). | | +| time | timestamp | No | CURRENT_TIMESTAMP | | When the downvote was recorded. | | +| original | tinyint | No | 0 | | Whether this is the original downvote (vs. an update). | | + +### downvote_reasons + +Defines the reasons for downvoting an entity. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| reason_id | int | No | auto_increment | | Primary key. | | +| description | varchar(256) | No | | | Description of the reason. | | + +### downloads + +Tracks download counts per user and entity. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| did | int | No | auto_increment | | Unique ID. | | +| knowledge_type | varchar(1) | No | | | Entity type code. | | +| knowledge_id | int | No | | | Entity ID. | | +| user_id | mediumint | No | | | User who downloaded. | | +| count | smallint | No | 1 | | Number of downloads. | | +| time | timestamp | No | CURRENT_TIMESTAMP | | Last download time. | | + + +--- + +## Miscellaneous + +The `notebook` and `pdnresults` are no longer in use. + +### schedule + +Defines scheduled experiment jobs to be executed. +Seems to be out of use since 2017. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| sid | int | No | | | Setup ID. | | +| task_id | int | No | | | Task ID to run. | | +| experiment | varchar(128) | No | | | Experiment identifier. | | +| active | enum('true','false') | No | 'true' | | Whether the schedule is active. | | +| last_assigned | datetime | Yes | NULL | | When this job was last assigned to a worker. | | +| ttid | int | No | | | Task type ID. | | +| dependencies | varchar(128) | No | | | Dependency identifiers. | | +| setup_string | text | No | | | Serialized setup configuration. | | + +### kaggle + +Maps OpenML datasets to Kaggle competitions or datasets. +Defined by hand in collaboration with Kaggle. +Only used by the frontend. + +| Column | Type | Optional | Default | References | Description | Example | +|--------|------|----------|---------|------------|-------------|---------| +| dataset_id | int | Yes | NULL | | OpenML dataset ID. | 6 | +| kaggle_link | varchar(500) | Yes | NULL | | URL to the Kaggle page. | https://www.kaggle.com/datasets/nishan192/letterrecognition-using-svm | diff --git a/mkdocs.yml b/mkdocs.yml index 7e9df195..1acd2313 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -19,6 +19,10 @@ nav: - index.md - Getting Started: installation.md - Changes: migration.md + - Database: + - database/index.md + - openml: database/openml.md + - openml_expdb: database/openml_expdb.md - Contributing: - contributing/index.md - Setup: contributing/setup.md