From c127a508a9f5f5c3f5e7018e7b016db2a33c00bd Mon Sep 17 00:00:00 2001 From: Thomas Rueckstiess Date: Mon, 3 Feb 2025 14:36:22 +1100 Subject: [PATCH 1/9] folder structure, 3 subdirs, splitting README. --- experiments/EXPERIMENTS.md | 27 +++++++ .../{prediction => json2vec}/.env.local | 0 .../{prediction => json2vec}/README.md | 71 +------------------ .../{prediction => json2vec}/__init__.py | 0 .../datasets/Internet-Advertisements.yml | 0 .../datasets/adult.yml | 0 .../datasets/bank-marketing.yml | 0 .../{prediction => json2vec}/datasets/car.yml | 0 .../{prediction => json2vec}/datasets/cmc.yml | 0 .../datasets/connect-4.yml | 0 .../datasets/cylinder-bands.yml | 0 .../datasets/ddxplus-json-v1.yml | 0 .../datasets/ddxplus-json-v2.yml | 0 .../datasets/ddxplus-raw.yml | 0 .../{prediction => json2vec}/datasets/dna.yml | 0 .../datasets/dresses-sales.yml | 0 .../datasets/dungeons-mk.yml | 0 .../datasets/dungeons-rkm.yml | 0 .../datasets/electricity.yml | 0 .../datasets/json2vec-automobile.yml | 0 .../datasets/json2vec-bank.yml | 0 .../datasets/json2vec-car.yml | 0 .../datasets/json2vec-contraceptive.yml | 0 .../datasets/json2vec-mushroom.yml | 0 .../datasets/json2vec-nursery.yml | 0 .../datasets/json2vec-seismic.yml | 0 .../datasets/json2vec-student.yml | 0 ...jungle_chess_2pcs_raw_endgame_complete.yml | 0 .../datasets/kr-vs-kp.yml | 0 .../datasets/letter.yml | 0 .../datasets/movies.yml | 0 .../datasets/mutagenesis.yml | 0 .../datasets/optdigits.yml | 0 .../datasets/phishing.yml | 0 .../datasets/semeion.yml | 0 .../datasets/sick.yml | 0 .../datasets/splice.yml | 0 .../datasets/tictactoe.yml | 0 .../{prediction => json2vec}/flags.yml | 0 .../{prediction => json2vec}/guild.yml | 0 .../{prediction => json2vec}/run_baseline.py | 0 .../{prediction => json2vec}/run_lightgbm.py | 0 .../{prediction => json2vec}/run_logreg.py | 0 .../{prediction => json2vec}/run_origami.py | 0 .../{prediction => json2vec}/run_rf.py | 0 .../{prediction => json2vec}/run_xgboost.py | 0 .../{prediction => json2vec}/runner.py | 0 experiments/{prediction => json2vec}/utils.py | 0 48 files changed, 30 insertions(+), 68 deletions(-) create mode 100644 experiments/EXPERIMENTS.md rename experiments/{prediction => json2vec}/.env.local (100%) rename experiments/{prediction => json2vec}/README.md (61%) rename experiments/{prediction => json2vec}/__init__.py (100%) rename experiments/{prediction => json2vec}/datasets/Internet-Advertisements.yml (100%) rename experiments/{prediction => json2vec}/datasets/adult.yml (100%) rename experiments/{prediction => json2vec}/datasets/bank-marketing.yml (100%) rename experiments/{prediction => json2vec}/datasets/car.yml (100%) rename experiments/{prediction => json2vec}/datasets/cmc.yml (100%) rename experiments/{prediction => json2vec}/datasets/connect-4.yml (100%) rename experiments/{prediction => json2vec}/datasets/cylinder-bands.yml (100%) rename experiments/{prediction => json2vec}/datasets/ddxplus-json-v1.yml (100%) rename experiments/{prediction => json2vec}/datasets/ddxplus-json-v2.yml (100%) rename experiments/{prediction => json2vec}/datasets/ddxplus-raw.yml (100%) rename experiments/{prediction => json2vec}/datasets/dna.yml (100%) rename experiments/{prediction => json2vec}/datasets/dresses-sales.yml (100%) rename experiments/{prediction => json2vec}/datasets/dungeons-mk.yml (100%) rename experiments/{prediction => json2vec}/datasets/dungeons-rkm.yml (100%) rename experiments/{prediction => json2vec}/datasets/electricity.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-automobile.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-bank.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-car.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-contraceptive.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-mushroom.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-nursery.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-seismic.yml (100%) rename experiments/{prediction => json2vec}/datasets/json2vec-student.yml (100%) rename experiments/{prediction => json2vec}/datasets/jungle_chess_2pcs_raw_endgame_complete.yml (100%) rename experiments/{prediction => json2vec}/datasets/kr-vs-kp.yml (100%) rename experiments/{prediction => json2vec}/datasets/letter.yml (100%) rename experiments/{prediction => json2vec}/datasets/movies.yml (100%) rename experiments/{prediction => json2vec}/datasets/mutagenesis.yml (100%) rename experiments/{prediction => json2vec}/datasets/optdigits.yml (100%) rename experiments/{prediction => json2vec}/datasets/phishing.yml (100%) rename experiments/{prediction => json2vec}/datasets/semeion.yml (100%) rename experiments/{prediction => json2vec}/datasets/sick.yml (100%) rename experiments/{prediction => json2vec}/datasets/splice.yml (100%) rename experiments/{prediction => json2vec}/datasets/tictactoe.yml (100%) rename experiments/{prediction => json2vec}/flags.yml (100%) rename experiments/{prediction => json2vec}/guild.yml (100%) rename experiments/{prediction => json2vec}/run_baseline.py (100%) rename experiments/{prediction => json2vec}/run_lightgbm.py (100%) rename experiments/{prediction => json2vec}/run_logreg.py (100%) rename experiments/{prediction => json2vec}/run_origami.py (100%) rename experiments/{prediction => json2vec}/run_rf.py (100%) rename experiments/{prediction => json2vec}/run_xgboost.py (100%) rename experiments/{prediction => json2vec}/runner.py (100%) rename experiments/{prediction => json2vec}/utils.py (100%) diff --git a/experiments/EXPERIMENTS.md b/experiments/EXPERIMENTS.md new file mode 100644 index 0000000..4646b19 --- /dev/null +++ b/experiments/EXPERIMENTS.md @@ -0,0 +1,27 @@ +# Reproducing the results from our paper + +This directory contains the code and instructions to reproduce the experiments from our paper: +[ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348). + +There are 3 sub-directories, each with their own `README.md` file: + +- [`json2vec`](./json2vec/README.md) contains the experiments from section 3.1, where we compare on standard tabular benchmarks that have been converted to JSON against various baselines and the json2vec models from [A Framework for End-to-End Learning on Semantic Tree-Structured Data](https://arxiv.org/abs/2002.05707) by William Woof and Ke Chen. +- [`ddxplus`](./ddxplus/README.md) contains the experiments from section 3.2 for a medical diagnosis task on patient information. This experiment demonstrates prediction of multi-token values representing arrays of possible pathologies. +- [`codenet`](./codenet/README.md) contains the experiments from section 3.3 related to a Java code classification task. Here we demonstrate the model's ability to deal with complex and deeply nested JSON objects. + +### Experiment Tracking + +We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking. + +### Datasets + +We bundled all datasets used in the paper in a [MongoDB dump file](). To reproduce the results, first +you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance: + +``` +mongorestore dump/ +``` + +This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data. + +If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in each sub-directory accordingly. diff --git a/experiments/prediction/.env.local b/experiments/json2vec/.env.local similarity index 100% rename from experiments/prediction/.env.local rename to experiments/json2vec/.env.local diff --git a/experiments/prediction/README.md b/experiments/json2vec/README.md similarity index 61% rename from experiments/prediction/README.md rename to experiments/json2vec/README.md index 1f6445a..6bc2f4b 100644 --- a/experiments/prediction/README.md +++ b/experiments/json2vec/README.md @@ -1,73 +1,8 @@ - - -# Reproducing the results from our paper - -We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking. - -### Datasets - -We bundled all datasets used in the paper in a convenient [MongoDB dump file](). To reproduce the results, first -you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance: - -``` -mongorestore dump/ -``` - -This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data. - -If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in this directory accordingly. +First, make sure you have restored the datasets from the mongo dump file as described in [../EXPERIMENTS.md](../EXPERIMENTS.md). All commands (see below) must be run from the `json2vec` directory. ### Hyper-parameter tuning diff --git a/experiments/prediction/__init__.py b/experiments/json2vec/__init__.py similarity index 100% rename from experiments/prediction/__init__.py rename to experiments/json2vec/__init__.py diff --git a/experiments/prediction/datasets/Internet-Advertisements.yml b/experiments/json2vec/datasets/Internet-Advertisements.yml similarity index 100% rename from experiments/prediction/datasets/Internet-Advertisements.yml rename to experiments/json2vec/datasets/Internet-Advertisements.yml diff --git a/experiments/prediction/datasets/adult.yml b/experiments/json2vec/datasets/adult.yml similarity index 100% rename from experiments/prediction/datasets/adult.yml rename to experiments/json2vec/datasets/adult.yml diff --git a/experiments/prediction/datasets/bank-marketing.yml b/experiments/json2vec/datasets/bank-marketing.yml similarity index 100% rename from experiments/prediction/datasets/bank-marketing.yml rename to experiments/json2vec/datasets/bank-marketing.yml diff --git a/experiments/prediction/datasets/car.yml b/experiments/json2vec/datasets/car.yml similarity index 100% rename from experiments/prediction/datasets/car.yml rename to experiments/json2vec/datasets/car.yml diff --git a/experiments/prediction/datasets/cmc.yml b/experiments/json2vec/datasets/cmc.yml similarity index 100% rename from experiments/prediction/datasets/cmc.yml rename to experiments/json2vec/datasets/cmc.yml diff --git a/experiments/prediction/datasets/connect-4.yml b/experiments/json2vec/datasets/connect-4.yml similarity index 100% rename from experiments/prediction/datasets/connect-4.yml rename to experiments/json2vec/datasets/connect-4.yml diff --git a/experiments/prediction/datasets/cylinder-bands.yml b/experiments/json2vec/datasets/cylinder-bands.yml similarity index 100% rename from experiments/prediction/datasets/cylinder-bands.yml rename to experiments/json2vec/datasets/cylinder-bands.yml diff --git a/experiments/prediction/datasets/ddxplus-json-v1.yml b/experiments/json2vec/datasets/ddxplus-json-v1.yml similarity index 100% rename from experiments/prediction/datasets/ddxplus-json-v1.yml rename to experiments/json2vec/datasets/ddxplus-json-v1.yml diff --git a/experiments/prediction/datasets/ddxplus-json-v2.yml b/experiments/json2vec/datasets/ddxplus-json-v2.yml similarity index 100% rename from experiments/prediction/datasets/ddxplus-json-v2.yml rename to experiments/json2vec/datasets/ddxplus-json-v2.yml diff --git a/experiments/prediction/datasets/ddxplus-raw.yml b/experiments/json2vec/datasets/ddxplus-raw.yml similarity index 100% rename from experiments/prediction/datasets/ddxplus-raw.yml rename to experiments/json2vec/datasets/ddxplus-raw.yml diff --git a/experiments/prediction/datasets/dna.yml b/experiments/json2vec/datasets/dna.yml similarity index 100% rename from experiments/prediction/datasets/dna.yml rename to experiments/json2vec/datasets/dna.yml diff --git a/experiments/prediction/datasets/dresses-sales.yml b/experiments/json2vec/datasets/dresses-sales.yml similarity index 100% rename from experiments/prediction/datasets/dresses-sales.yml rename to experiments/json2vec/datasets/dresses-sales.yml diff --git a/experiments/prediction/datasets/dungeons-mk.yml b/experiments/json2vec/datasets/dungeons-mk.yml similarity index 100% rename from experiments/prediction/datasets/dungeons-mk.yml rename to experiments/json2vec/datasets/dungeons-mk.yml diff --git a/experiments/prediction/datasets/dungeons-rkm.yml b/experiments/json2vec/datasets/dungeons-rkm.yml similarity index 100% rename from experiments/prediction/datasets/dungeons-rkm.yml rename to experiments/json2vec/datasets/dungeons-rkm.yml diff --git a/experiments/prediction/datasets/electricity.yml b/experiments/json2vec/datasets/electricity.yml similarity index 100% rename from experiments/prediction/datasets/electricity.yml rename to experiments/json2vec/datasets/electricity.yml diff --git a/experiments/prediction/datasets/json2vec-automobile.yml b/experiments/json2vec/datasets/json2vec-automobile.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-automobile.yml rename to experiments/json2vec/datasets/json2vec-automobile.yml diff --git a/experiments/prediction/datasets/json2vec-bank.yml b/experiments/json2vec/datasets/json2vec-bank.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-bank.yml rename to experiments/json2vec/datasets/json2vec-bank.yml diff --git a/experiments/prediction/datasets/json2vec-car.yml b/experiments/json2vec/datasets/json2vec-car.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-car.yml rename to experiments/json2vec/datasets/json2vec-car.yml diff --git a/experiments/prediction/datasets/json2vec-contraceptive.yml b/experiments/json2vec/datasets/json2vec-contraceptive.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-contraceptive.yml rename to experiments/json2vec/datasets/json2vec-contraceptive.yml diff --git a/experiments/prediction/datasets/json2vec-mushroom.yml b/experiments/json2vec/datasets/json2vec-mushroom.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-mushroom.yml rename to experiments/json2vec/datasets/json2vec-mushroom.yml diff --git a/experiments/prediction/datasets/json2vec-nursery.yml b/experiments/json2vec/datasets/json2vec-nursery.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-nursery.yml rename to experiments/json2vec/datasets/json2vec-nursery.yml diff --git a/experiments/prediction/datasets/json2vec-seismic.yml b/experiments/json2vec/datasets/json2vec-seismic.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-seismic.yml rename to experiments/json2vec/datasets/json2vec-seismic.yml diff --git a/experiments/prediction/datasets/json2vec-student.yml b/experiments/json2vec/datasets/json2vec-student.yml similarity index 100% rename from experiments/prediction/datasets/json2vec-student.yml rename to experiments/json2vec/datasets/json2vec-student.yml diff --git a/experiments/prediction/datasets/jungle_chess_2pcs_raw_endgame_complete.yml b/experiments/json2vec/datasets/jungle_chess_2pcs_raw_endgame_complete.yml similarity index 100% rename from experiments/prediction/datasets/jungle_chess_2pcs_raw_endgame_complete.yml rename to experiments/json2vec/datasets/jungle_chess_2pcs_raw_endgame_complete.yml diff --git a/experiments/prediction/datasets/kr-vs-kp.yml b/experiments/json2vec/datasets/kr-vs-kp.yml similarity index 100% rename from experiments/prediction/datasets/kr-vs-kp.yml rename to experiments/json2vec/datasets/kr-vs-kp.yml diff --git a/experiments/prediction/datasets/letter.yml b/experiments/json2vec/datasets/letter.yml similarity index 100% rename from experiments/prediction/datasets/letter.yml rename to experiments/json2vec/datasets/letter.yml diff --git a/experiments/prediction/datasets/movies.yml b/experiments/json2vec/datasets/movies.yml similarity index 100% rename from experiments/prediction/datasets/movies.yml rename to experiments/json2vec/datasets/movies.yml diff --git a/experiments/prediction/datasets/mutagenesis.yml b/experiments/json2vec/datasets/mutagenesis.yml similarity index 100% rename from experiments/prediction/datasets/mutagenesis.yml rename to experiments/json2vec/datasets/mutagenesis.yml diff --git a/experiments/prediction/datasets/optdigits.yml b/experiments/json2vec/datasets/optdigits.yml similarity index 100% rename from experiments/prediction/datasets/optdigits.yml rename to experiments/json2vec/datasets/optdigits.yml diff --git a/experiments/prediction/datasets/phishing.yml b/experiments/json2vec/datasets/phishing.yml similarity index 100% rename from experiments/prediction/datasets/phishing.yml rename to experiments/json2vec/datasets/phishing.yml diff --git a/experiments/prediction/datasets/semeion.yml b/experiments/json2vec/datasets/semeion.yml similarity index 100% rename from experiments/prediction/datasets/semeion.yml rename to experiments/json2vec/datasets/semeion.yml diff --git a/experiments/prediction/datasets/sick.yml b/experiments/json2vec/datasets/sick.yml similarity index 100% rename from experiments/prediction/datasets/sick.yml rename to experiments/json2vec/datasets/sick.yml diff --git a/experiments/prediction/datasets/splice.yml b/experiments/json2vec/datasets/splice.yml similarity index 100% rename from experiments/prediction/datasets/splice.yml rename to experiments/json2vec/datasets/splice.yml diff --git a/experiments/prediction/datasets/tictactoe.yml b/experiments/json2vec/datasets/tictactoe.yml similarity index 100% rename from experiments/prediction/datasets/tictactoe.yml rename to experiments/json2vec/datasets/tictactoe.yml diff --git a/experiments/prediction/flags.yml b/experiments/json2vec/flags.yml similarity index 100% rename from experiments/prediction/flags.yml rename to experiments/json2vec/flags.yml diff --git a/experiments/prediction/guild.yml b/experiments/json2vec/guild.yml similarity index 100% rename from experiments/prediction/guild.yml rename to experiments/json2vec/guild.yml diff --git a/experiments/prediction/run_baseline.py b/experiments/json2vec/run_baseline.py similarity index 100% rename from experiments/prediction/run_baseline.py rename to experiments/json2vec/run_baseline.py diff --git a/experiments/prediction/run_lightgbm.py b/experiments/json2vec/run_lightgbm.py similarity index 100% rename from experiments/prediction/run_lightgbm.py rename to experiments/json2vec/run_lightgbm.py diff --git a/experiments/prediction/run_logreg.py b/experiments/json2vec/run_logreg.py similarity index 100% rename from experiments/prediction/run_logreg.py rename to experiments/json2vec/run_logreg.py diff --git a/experiments/prediction/run_origami.py b/experiments/json2vec/run_origami.py similarity index 100% rename from experiments/prediction/run_origami.py rename to experiments/json2vec/run_origami.py diff --git a/experiments/prediction/run_rf.py b/experiments/json2vec/run_rf.py similarity index 100% rename from experiments/prediction/run_rf.py rename to experiments/json2vec/run_rf.py diff --git a/experiments/prediction/run_xgboost.py b/experiments/json2vec/run_xgboost.py similarity index 100% rename from experiments/prediction/run_xgboost.py rename to experiments/json2vec/run_xgboost.py diff --git a/experiments/prediction/runner.py b/experiments/json2vec/runner.py similarity index 100% rename from experiments/prediction/runner.py rename to experiments/json2vec/runner.py diff --git a/experiments/prediction/utils.py b/experiments/json2vec/utils.py similarity index 100% rename from experiments/prediction/utils.py rename to experiments/json2vec/utils.py From 48eac54e0adf6c5b071384bbbc17426ec53f8cf2 Mon Sep 17 00:00:00 2001 From: Thomas Rueckstiess Date: Mon, 3 Feb 2025 15:52:42 +1100 Subject: [PATCH 2/9] ddxplus experiments, currently fails in run_origami.py:164. --- experiments/ddxplus/.env.local | 2 + experiments/ddxplus/README.md | 75 ++++ experiments/ddxplus/__init__.py | 0 experiments/ddxplus/baseline.ipynb | 420 +++++++++++++++++++++ experiments/ddxplus/baseline.py | 57 +++ experiments/ddxplus/collect_results.ipynb | 202 ++++++++++ experiments/ddxplus/guild.yml | 428 ++++++++++++++++++++++ experiments/ddxplus/run_origami.py | 181 +++++++++ experiments/ddxplus/utils.py | 56 +++ 9 files changed, 1421 insertions(+) create mode 100644 experiments/ddxplus/.env.local create mode 100644 experiments/ddxplus/README.md create mode 100644 experiments/ddxplus/__init__.py create mode 100644 experiments/ddxplus/baseline.ipynb create mode 100644 experiments/ddxplus/baseline.py create mode 100644 experiments/ddxplus/collect_results.ipynb create mode 100644 experiments/ddxplus/guild.yml create mode 100644 experiments/ddxplus/run_origami.py create mode 100644 experiments/ddxplus/utils.py diff --git a/experiments/ddxplus/.env.local b/experiments/ddxplus/.env.local new file mode 100644 index 0000000..9d63cbd --- /dev/null +++ b/experiments/ddxplus/.env.local @@ -0,0 +1,2 @@ +MONGO_URI="mongodb://localhost:27017" +DATABASE=ddxplus \ No newline at end of file diff --git a/experiments/ddxplus/README.md b/experiments/ddxplus/README.md new file mode 100644 index 0000000..1504062 --- /dev/null +++ b/experiments/ddxplus/README.md @@ -0,0 +1,75 @@ +# DDXPlus Experiments + +In this experiment we train the model on the [DDXPlus dataset](https://arxiv.org/abs/2205.09148), a dataset for automated medical diagnosis. We devise a task to predict the most likely differential diagnoses for each instance, a multi-label prediction task. + +For ORIGAMI, we reformat the dataset into JSON format with two different representations: + +- A flat representation, in which we store the evidences and their values as strings. +- An object representation, where the evidences are stored as object containing array values. + +We compare our model against baselines: Logistic Regression, Random Forests, XGBoost, LightGBM. The baselines are trained on a +flat representation by converting the evidence-value strings into a multi-label binary matrix. We wrap each model in a scikit-learn +[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html). + +First, make sure you have restored the datasets from the mongo dump file as described in [../EXPERIMENTS.md](../EXPERIMENTS.md). All commands (see below) must be run from the `ddxplus` directory. + +## ORiGAMi + +## Baselines + +### Single Run + +Running a single model, using `guild.yml` settings, as: + +```bash +guild run lr:hyperopt lr_C=10.0 +``` + +**Note**: parameters passed through the CLI overwrite values in the notebook and in `guild.yml`. + +It is also possible to run the notebook directly, though this will create a number of additional quantities tracked +by guild (e.g. `TARGET_FIELD`) which we may not be interested in, and as such, this method is for quick local checks only. + +```bash +guild run baseline.ipynb model_name=LogisticRegression lr_C=0.1 +``` + +### Experimental Runs + +First perform HPO, supplying `limit=0` and the appropriate number of `--max-trials`: + +```bash +guild run lr:hyperopt limit=1000 --optimizer random --max-trials 3 +``` + +Once the optimal hyperparameters are found: + +- update the `prod` sections in the `guild.yml` file in this folder with the optimal hyperparameters values +- run the model with the optimal hyperparameters, 5 repetitions with 5 different seeds: + +```bash +guild run lr:prod +``` + +## Retrieving Results + +To retrieve the values for the individual runs, we have 2 alternatives: + +```python +from axon.utils.guild import get_runs +from guild import ipy, tfevent + +# 1st alternative +runs = get_runs() +run = runs[0] + +for _path, _digest, scalars in tfevent.scalar_readers(run.dir): + for tag, value, step in scalars: + print(tag, value, step) + +# 2nd alternative (as pandas dataframe) +runs = ipy.runs() +sd_df = runs.scalars_detail() +sd_df['run_id'] = sd_df['run'].apply(lambda x: x.id) +print(sd_df[sd_df['run_id'] == run.id]) +``` diff --git a/experiments/ddxplus/__init__.py b/experiments/ddxplus/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/ddxplus/baseline.ipynb b/experiments/ddxplus/baseline.ipynb new file mode 100644 index 0000000..f9650cb --- /dev/null +++ b/experiments/ddxplus/baseline.ipynb @@ -0,0 +1,420 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "e5fb0cddd6a59264", + "metadata": { + "jupyter": { + "is_executing": true + } + }, + "outputs": [], + "source": [ + "from collections import defaultdict\n", + "import warnings\n", + "\n", + "import numpy as np\n", + "from lightgbm import LGBMClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.exceptions import ConvergenceWarning\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.multioutput import MultiOutputClassifier\n", + "from sklearn.preprocessing import MultiLabelBinarizer\n", + "from xgboost import XGBClassifier\n", + "\n", + "from axon.gpt.data import load_df_from_mongodb\n", + "from axon.utils.guild import load_secrets, print_guild_scalars\n", + "from utils import get_scores" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "cf864690a3f51d36", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Running model_name='LogisticRegression', limit=20000, n_random_seeds=5\n" + ] + } + ], + "source": [ + "# experiment flags\n", + "model_name = \"LogisticRegression\" # \"XGBoost\" # \"RandomForest\"\n", + "limit = 1000\n", + "n_random_seeds = 5\n", + "\n", + "print(f\"Running {model_name=}, {limit=}, {n_random_seeds=}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "id": "2827794b", + "metadata": {}, + "outputs": [], + "source": [ + "# defaul model hyper parameters\n", + "\n", + "# logistic regression\n", + "lr_C = 1.0\n", + "lr_penalty = \"none\"\n", + "lr_max_iter = 50\n", + "lr_fit_intercept = True\n", + "\n", + "# xgboost\n", + "xgb_learning_rate = 0.1\n", + "xgb_max_depth = 5\n", + "xgb_subsample = 1.0\n", + "xgb_colsample_bytree = 1.0\n", + "xgb_colsample_bylevel = 1.0\n", + "xgb_min_child_weight = 1.0\n", + "xgb_reg_alpha = 0.0\n", + "xgb_reg_lambda = 1.0\n", + "xgb_gamma = 0\n", + "xgb_n_estimators = 100\n", + "\n", + "# random forest\n", + "rf_n_estimators = 100\n", + "rf_max_features = \"none\"\n", + "rf_max_depth = \"none\"\n", + "rf_min_samples_split = 5\n", + "\n", + "# lightgbm\n", + "lgb_num_leaves = 10\n", + "lgb_max_depth = 5\n", + "lgb_learning_rate = 0.1\n", + "lgb_n_estimators = 100\n", + "lgb_min_child_weight = 1.0\n", + "lgb_subsample = 0.8\n", + "lgb_colsample_bytree = 0.8\n", + "lgb_reg_alpha = 0.0\n", + "lgb_reg_lambda = 1.0" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "id": "ca4fafa10c64b8dc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "loading local secrets.\n", + "\n" + ] + } + ], + "source": [ + "secrets = load_secrets()" + ] + }, + { + "cell_type": "markdown", + "id": "9b137e20", + "metadata": {}, + "source": [ + "# Data" + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "id": "55de59c8", + "metadata": {}, + "outputs": [], + "source": [ + "PROJECTION = {\"_id\": 0, \"DIFFERENTIAL_DIAGNOSIS\": 0}\n", + "TARGET_FIELD = \"DIFFERENTIAL_DIAGNOSIS_NOPROB\"\n", + "\n", + "\n", + "def load_docs(collection_name):\n", + " return load_df_from_mongodb(\n", + " uri=secrets[\"MONGO_URI\"],\n", + " db=secrets[\"DATABASE\"],\n", + " coll=collection_name,\n", + " projection=PROJECTION,\n", + " sort=[(\"_id\", 1)],\n", + " limit=limit\n", + " )\n", + "\n", + "\n", + "def preprocess_dataset(df):\n", + " # pull up relevant fields at the top of the df\n", + " df[\"EVIDENCES\"] = df[\"docs\"].apply(lambda x: x[\"EVIDENCES\"])\n", + " df[\"DIFFERENTIAL_DIAGNOSIS_NOPROB\"] = df[\"docs\"].apply(lambda x: x[\"DIFFERENTIAL_DIAGNOSIS_NOPROB\"])\n", + " df[\"PATHOLOGY\"] = df[\"docs\"].apply(lambda x: x[\"PATHOLOGY\"])\n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": 130, + "id": "b0a33a03", + "metadata": {}, + "outputs": [], + "source": [ + "# load data\n", + "\n", + "train_docs_df = load_docs(collection_name=\"train-noprob\").pipe(preprocess_dataset)\n", + "test_docs_df = load_docs(collection_name=\"test-noprob\").pipe(preprocess_dataset)\n", + "val_docs_df = load_docs(collection_name=\"validate-noprob\").pipe(preprocess_dataset)" + ] + }, + { + "cell_type": "markdown", + "id": "a0958ddb", + "metadata": {}, + "source": [ + "# ML" + ] + }, + { + "cell_type": "code", + "execution_count": 132, + "id": "e6d5e978", + "metadata": {}, + "outputs": [], + "source": [ + "def get_classifier(model_name, seed):\n", + " match model_name:\n", + " case \"LogisticRegression\":\n", + " clf = LogisticRegression(\n", + " random_state=seed,\n", + " C=lr_C if lr_penalty != \"none\" else 1.0,\n", + " penalty=lr_penalty if lr_penalty != \"none\" else None,\n", + " max_iter=lr_max_iter,\n", + " fit_intercept=True if lr_fit_intercept == 1 else False,\n", + " solver=\"saga\"\n", + " )\n", + " case \"XGBoost\":\n", + " clf = XGBClassifier(\n", + " random_state=seed,\n", + " max_depth=xgb_max_depth,\n", + " learning_rate=xgb_learning_rate,\n", + " n_estimators=xgb_n_estimators,\n", + " subsample=xgb_subsample,\n", + " colsample_bytree=xgb_colsample_bytree,\n", + " colsample_bylevel=xgb_colsample_bylevel,\n", + " min_child_weight=xgb_min_child_weight,\n", + " reg_alpha=xgb_reg_alpha,\n", + " reg_lambda=xgb_reg_lambda,\n", + " gamma=xgb_gamma,\n", + " )\n", + " case \"RandomForest\":\n", + " clf = RandomForestClassifier(\n", + " random_state=seed,\n", + " n_estimators=rf_n_estimators,\n", + " max_features=rf_max_features if rf_max_features != \"none\" else None,\n", + " max_depth=rf_max_depth if rf_max_depth != \"none\" else None,\n", + " min_samples_split=rf_min_samples_split\n", + " )\n", + " case \"LightGBM\":\n", + " clf = LGBMClassifier(\n", + " random_state=seed,\n", + " verbose=-1,\n", + " num_leaves=lgb_num_leaves,\n", + " max_depth=lgb_max_depth,\n", + " learning_rate=lgb_learning_rate,\n", + " n_estimators=lgb_n_estimators,\n", + " min_child_weight=lgb_min_child_weight,\n", + " subsample=lgb_subsample,\n", + " colsample_bytree=lgb_colsample_bytree,\n", + " reg_alpha=lgb_reg_alpha,\n", + " reg_lambda=lgb_reg_lambda,\n", + " )\n", + "\n", + " case _:\n", + " raise ValueError(f\"Unknown model {model_name}\")\n", + " return clf" + ] + }, + { + "cell_type": "code", + "execution_count": 133, + "id": "98bd9c12", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\RobinVujanic\\venvs\\axon\\Lib\\site-packages\\sklearn\\preprocessing\\_label.py:900: UserWarning: unknown class(es) ['E_152_@_V_132', 'E_55_@_V_136', 'E_55_@_V_178'] will be ignored\n", + " warnings.warn(\n", + "C:\\Users\\RobinVujanic\\venvs\\axon\\Lib\\site-packages\\sklearn\\preprocessing\\_label.py:900: UserWarning: unknown class(es) ['E_133_@_V_162', 'E_55_@_V_136', 'E_55_@_V_178', 'E_55_@_V_45'] will be ignored\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "# encode data\n", + "mlb_ddx = MultiLabelBinarizer()\n", + "mlb_evd = MultiLabelBinarizer()\n", + "\n", + "# train\n", + "X_train = mlb_evd.fit_transform(train_docs_df[\"EVIDENCES\"])\n", + "y_train = mlb_ddx.fit_transform(train_docs_df[\"DIFFERENTIAL_DIAGNOSIS_NOPROB\"])\n", + "\n", + "# val\n", + "X_val = mlb_evd.transform(val_docs_df[\"EVIDENCES\"])\n", + "y_val = mlb_ddx.transform(val_docs_df[\"DIFFERENTIAL_DIAGNOSIS_NOPROB\"])\n", + "y_pathology_val = mlb_ddx.transform(val_docs_df[\"PATHOLOGY\"].apply(lambda x: [x, ]))\n", + "y_pathology_val = np.where(y_pathology_val > 0.5)[1]\n", + "\n", + "# test\n", + "X_test = mlb_evd.transform(test_docs_df[\"EVIDENCES\"])\n", + "y_test = mlb_ddx.transform(test_docs_df[\"DIFFERENTIAL_DIAGNOSIS_NOPROB\"])\n", + "y_pathology_test = mlb_ddx.transform(test_docs_df[\"PATHOLOGY\"].apply(lambda x: [x, ]))\n", + "y_pathology_test = np.where(y_pathology_test > 0.5)[1]" + ] + }, + { + "cell_type": "code", + "execution_count": 134, + "id": "242ff782", + "metadata": {}, + "outputs": [], + "source": [ + "# X_test[:3,:], y_test[:3,:], y_pathology_test[:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 135, + "id": "d46e57e7", + "metadata": {}, + "outputs": [], + "source": [ + "results = defaultdict(list)" + ] + }, + { + "cell_type": "code", + "execution_count": 136, + "id": "a152877e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training LogisticRegression(max_iter=10, penalty='l1', random_state=0,\n", + " solver='liblinear')\n", + "| step: 0 | recall_val: 0.9004263372938938 | precision_val: 0.8870666039515851 | f1_val: 0.8936965452577607 | gtpa_val: 0.99585 | gtpa_at_1_val: 0.69515 | recall_test: 0.9008239804993183 | precision_test: 0.8882346888419146 | f1_test: 0.8944850403534006 | gtpa_test: 0.9952 | gtpa_at_1_test: 0.69995 |\n", + "Training LogisticRegression(max_iter=10, penalty='l1', random_state=1,\n", + " solver='liblinear')\n", + "| step: 1 | recall_val: 0.9006822054160886 | precision_val: 0.8869225825533802 | f1_val: 0.8937494384258217 | gtpa_val: 0.99585 | gtpa_at_1_val: 0.6902 | recall_test: 0.9011261178939474 | precision_test: 0.8880576288652519 | f1_test: 0.8945441461951216 | gtpa_test: 0.9953 | gtpa_at_1_test: 0.6949 |\n", + "Training LogisticRegression(max_iter=10, penalty='l1', random_state=2,\n", + " solver='liblinear')\n", + "| step: 2 | recall_val: 0.9007511141601579 | precision_val: 0.8871587858311466 | f1_val: 0.8939032831332995 | gtpa_val: 0.99595 | gtpa_at_1_val: 0.6956 | recall_test: 0.9011501989082487 | precision_test: 0.8884827213705088 | f1_test: 0.8947716283234932 | gtpa_test: 0.9953 | gtpa_at_1_test: 0.69845 |\n", + "Training LogisticRegression(max_iter=10, penalty='l1', random_state=3,\n", + " solver='liblinear')\n", + "| step: 3 | recall_val: 0.9000891652192525 | precision_val: 0.8869543302750852 | f1_val: 0.8934734769889457 | gtpa_val: 0.99595 | gtpa_at_1_val: 0.68845 | recall_test: 0.9007773470981861 | precision_test: 0.8884049064098483 | f1_test: 0.8945483481918315 | gtpa_test: 0.9953 | gtpa_at_1_test: 0.6939 |\n", + "Training LogisticRegression(max_iter=10, penalty='l1', random_state=4,\n", + " solver='liblinear')\n", + "| step: 4 | recall_val: 0.9006251930612028 | precision_val: 0.8868180759803402 | f1_val: 0.8936683079382203 | gtpa_val: 0.9959 | gtpa_at_1_val: 0.6902 | recall_test: 0.9010830305397445 | precision_test: 0.8886362504525425 | f1_test: 0.8948163593133946 | gtpa_test: 0.9954 | gtpa_at_1_test: 0.69505 |\n", + "CPU times: total: 18.2 s\n", + "Wall time: 29.4 s\n" + ] + } + ], + "source": [ + "for clf_seed in range(n_random_seeds):\n", + " clf = get_classifier(model_name=model_name, seed=clf_seed)\n", + " multi_output_clf = MultiOutputClassifier(clf, n_jobs=4)\n", + " print(f'Training {clf}')\n", + "\n", + " # train\n", + " with warnings.catch_warnings():\n", + " warnings.simplefilter(action='ignore', category=ConvergenceWarning)\n", + " multi_output_clf.fit(X_train, y_train)\n", + "\n", + " # evaluate dev\n", + " y_pred_val = multi_output_clf.predict_proba(X_val)\n", + " y_pred_val = np.hstack([y_pred_val_i[:, 1].reshape(-1, 1) for y_pred_val_i in y_pred_val])\n", + "\n", + " scores_val = get_scores(y_target=y_val, y_pred=y_pred_val, y_pathology=y_pathology_val, postfix=\"_val\")\n", + " for score_name, score in scores_val.items():\n", + " results[score_name].append(score)\n", + "\n", + " # evaluate test\n", + " y_pred_test = multi_output_clf.predict_proba(X_test)\n", + " y_pred_test = np.hstack([y_pred_test_i[:, 1].reshape(-1, 1) for y_pred_test_i in y_pred_test])\n", + "\n", + " scores_test = get_scores(y_target=y_test, y_pred=y_pred_test, y_pathology=y_pathology_test, postfix=\"_test\")\n", + " for score_name, score in scores_test.items():\n", + " results[score_name].append(score)\n", + "\n", + " guild_output = {\"step\": clf_seed} | scores_val | scores_test\n", + " print_guild_scalars(**guild_output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93dd715f", + "metadata": {}, + "outputs": [], + "source": [ + "# print('Individual fold metrics:')\n", + "# print_guild_scalars(**results['val'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e3f81a21f74ed13", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Aggregated metrics:\")\n", + "keys = list(results.keys())\n", + "scalars = {}\n", + "for key in keys:\n", + " scalars[f\"{key}_mean\"] = np.mean(results[key])\n", + " scalars[f\"{key}_std\"] = np.std(results[key])\n", + " scalars[f\"{key}_min\"] = np.min(results[key])\n", + " scalars[f\"{key}_max\"] = np.max(results[key])\n", + "\n", + "# print rounded scalars\n", + "print_guild_scalars(**{k: f\"{v:.4f}\" for k, v in scalars.items()})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adb191c9", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/experiments/ddxplus/baseline.py b/experiments/ddxplus/baseline.py new file mode 100644 index 0000000..9c923cc --- /dev/null +++ b/experiments/ddxplus/baseline.py @@ -0,0 +1,57 @@ +from types import SimpleNamespace + +from pymongo import MongoClient +from sklearn.model_selection import StratifiedKFold + +from axon.gpt.data import load_df_from_mongodb +from axon.gpt.utils import set_seed +from axon.utils.guild import load_secrets + +flags = SimpleNamespace() +secrets = load_secrets() + +set_seed(flags.seed) + +# load PATHOLOGY fields for stratified cv splits +client = MongoClient(secrets["MONGO_URI"]) +collection = client.ddxplus["train-noprob"] + +pathologies = [ + d["PATHOLOGY"] for d in collection.find({}, projection={"PATHOLOGY": 1}, limit=flags.limit, sort=[("_id", 1)]) +] + +# now load data properly for training, same sort order +PROJECTION = {"_id": 0, "PATHOLOGY": 0, "DIFFERENTIAL_DIAGNOSIS": 0} +TARGET_FIELD = "DIFFERENTIAL_DIAGNOSIS_NOPROB" + +docs_df = load_df_from_mongodb( + secrets["MONGO_URI"], "ddxplus", "train-noprob", projection=PROJECTION, limit=flags.limit, sort=[("_id", 1)] +) + +cv_scores = [] + +# create cross-validation splits +kfold = StratifiedKFold(n_splits=flags.n_cv_splits, shuffle=True, random_state=flags.seed) +splits = list(kfold.split(docs_df, pathologies)) +splits = [(train.tolist(), test.tolist()) for train, test in splits] + +for k, (train_ixs, test_ixs) in enumerate(splits): + pass + # TODO train models + + # print results for this fold + # print_guild_scalars(fold=k, ddr=ddr, ddp=ddp, f1=f1, gtpa_at_1=gtpa_at_1, gtpa=gtpa) + # cv_scores.append({"ddr": ddr, "ddp": ddp, "f1": f1, "gtpa_at_1": gtpa_at_1, "gtpa": gtpa}) + + +# print("cross-validation results:") +# keys = list(cv_scores[0].keys()) +# scalars = {} +# for key in keys: +# scalars[f"{key}_mean"] = np.mean([e[key] for e in cv_scores]) +# scalars[f"{key}_std"] = np.std([e[key] for e in cv_scores]) +# scalars[f"{key}_min"] = np.min([e[key] for e in cv_scores]) +# scalars[f"{key}_max"] = np.max([e[key] for e in cv_scores]) + +# print rounded scalars +# print_guild_scalars(**{k: f"{v:.4f}" for k, v in scalars.items()}) diff --git a/experiments/ddxplus/collect_results.ipynb b/experiments/ddxplus/collect_results.ipynb new file mode 100644 index 0000000..68267d4 --- /dev/null +++ b/experiments/ddxplus/collect_results.ipynb @@ -0,0 +1,202 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/tr/code/python/axon/.venv/lib/python3.10/site-packages/guild/ipy.py:207: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", + " return [row[1][0].value for row in self.iterrows()]\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
tagddpddrf1gtpagtpa_at_1
00.9598700.9688720.9643500.9981270.741030
10.9560660.9708810.9634160.9979040.740837
20.9591470.9675200.9633150.9977180.739372
30.9568130.9697490.9632380.9979190.740941
40.9577910.9684510.9630920.9978670.741364
\n", + "
" + ], + "text/plain": [ + "tag ddp ddr f1 gtpa gtpa_at_1\n", + "0 0.959870 0.968872 0.964350 0.998127 0.741030\n", + "1 0.956066 0.970881 0.963416 0.997904 0.740837\n", + "2 0.959147 0.967520 0.963315 0.997718 0.739372\n", + "3 0.956813 0.969749 0.963238 0.997919 0.740941\n", + "4 0.957791 0.968451 0.963092 0.997867 0.741364" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
tagddpddrf1gtpagtpa_at_1
mean95.79375496.90946596.34822799.79067874.070870
std0.1580190.1280960.0499500.0146470.077287
\n", + "
" + ], + "text/plain": [ + "tag ddp ddr f1 gtpa gtpa_at_1\n", + "mean 95.793754 96.909465 96.348227 99.790678 74.070870\n", + "std 0.158019 0.128096 0.049950 0.014647 0.077287" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import guild.ipy as guild\n", + "\n", + "runs = guild.runs(labels=[\"ddxplus-11-eval-test\"], completed=True)\n", + "scalars = runs.scalars()\n", + "scalars\n", + "\n", + "# only keep rows where tag is one of [\"ddp\", \"ddr\", \"f1\", \"gtpa\", \"gtpa_at_1\"]\n", + "scalars = scalars[scalars[\"tag\"].isin([\"ddp\", \"ddr\", \"f1\", \"gtpa\", \"gtpa_at_1\"])]\n", + "\n", + "scalars = scalars.pivot(index=\"run\", columns=\"tag\", values=\"first_val\")\n", + "scalars = scalars.sort_values(by=\"f1\", ascending=False, ignore_index=True)\n", + "\n", + "display(scalars)\n", + "results = scalars.aggregate([\"mean\", \"std\"])\n", + "\n", + "results" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/experiments/ddxplus/guild.yml b/experiments/ddxplus/guild.yml new file mode 100644 index 0000000..8d6f2e1 --- /dev/null +++ b/experiments/ddxplus/guild.yml @@ -0,0 +1,428 @@ +- model: origami + operations: + train: + main: run_origami + flags-dest: namespace:flags + flags: + model_size: + default: medium + choices: [xs, small, medium, large, xl] + seed: 1234 + n_batches: 33000 + eval_data: + default: validate + choices: [validate, test] + limit: 0 + verbose: False + + requires: + - file: .env.local + - file: .env.remote + + # matches the guild_output_scalars() helper function + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + +- config: shared-flags + flags: + limit: + default: 0 + nb-replace: 'limit = (\d+)' + type: int + +- model: lr + operations: + hyperopt: + description: "hyper-parameter tuning of LogisticRegression baseline" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: LogisticRegression + n_random_seeds: + default: 1 + nb-replace: 'n_random_seeds = (\d+)' + lr_penalty: + choices: ["l1", "l2", "none"] + nb-replace: 'lr_penalty = (\w+)' + lr_max_iter: + choices: [10, 50, 100, 300, 500, 1000, 5000] + nb-replace: 'lr_max_iter = (\d+)' + type: int + lr_fit_intercept: + choices: [True, False] + nb-replace: 'lr_fit_intercept = (\w+)' + lr_C: + choices: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5] + nb-replace: 'lr_C = ([\d\.e-]+)' + type: float + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + + prod: + description: "run LogisticRegression model with optimal hyperparams" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: LogisticRegression + n_random_seeds: + default: 5 + nb-replace: 'n_random_seeds = (\d+)' + lr_penalty: + default: "CHANGE HERE" + nb-replace: 'lr_penalty = (\w+)' + lr_max_iter: + default: 0 + nb-replace: 'lr_max_iter = (\d+)' + type: int + lr_fit_intercept: + default: "CHANGE HERE" + nb-replace: 'lr_fit_intercept = (\w+)' + lr_C: + default: 0 + nb-replace: 'lr_C = ([\d\.e-]+)' + type: float + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + +- model: rf + operations: + hyperopt: + description: "hyper-parameter tuning of RandomForest baseline" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: RandomForest + n_random_seeds: + default: 1 + nb-replace: 'n_random_seeds = (\d+)' + rf_n_estimators: + choices: [20, 50, 100, 150, 200] + nb-replace: 'rf_n_estimators = (\d+)' + type: int + rf_max_features: + choices: ["log2", "sqrt", "none"] + nb-replace: 'rf_max_features = (\w+)' + rf_max_depth: + choices: [0, 1, 5, 10, 20, 30, 45, "none"] + nb-replace: 'rf_max_depth = (\d+)' + rf_min_samples_split: + choices: [5, 10] + nb-replace: 'rf_min_samples_split = (\d+)' + type: int + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + + prod: + description: "run RandomForest model with optimal hyperparams" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: RandomForest + n_random_seeds: + default: 5 + nb-replace: 'n_random_seeds = (\d+)' + rf_n_estimators: + default: 0 + nb-replace: 'rf_n_estimators = (\d+)' + type: int + rf_max_features: + default: "CHANGE HERE" + nb-replace: 'rf_max_features = (\w+)' + rf_max_depth: + default: 0 + nb-replace: 'rf_max_depth = (\d+)' + rf_min_samples_split: + default: 0 + nb-replace: 'rf_min_samples_split = (\d+)' + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + +- model: xgb + operations: + hyperopt: + description: "hyper-parameter tuning of XGBoost baseline" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: XGBoost + n_random_seeds: + default: 1 + nb-replace: 'n_random_seeds = (\d+)' + xgb_learning_rate: + choices: [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0] + nb-replace: 'xgb_learning_rate = ([\d\.e-]+)' + type: float + xgb_max_depth: + choices: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] + nb-replace: 'xgb_max_depth = (\d+)' + type: int + xgb_subsample: + choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] + nb-replace: 'xgb_subsample = ([\d\.]+)' + type: float + xgb_colsample_bytree: + choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] + nb-replace: 'xgb_colsample_bytree = ([\d\.]+)' + type: float + xgb_colsample_bylevel: + choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] + nb-replace: 'xgb_colsample_bylevel = ([\d\.]+)' + type: float + xgb_min_child_weight: + choices: + [ + 1e-16, + 1e-15, + 1e-14, + 1e-13, + 1e-12, + 1e-11, + 1e-10, + 1e-9, + 1e-8, + 1e-7, + 1e-6, + 1e-5, + 1e-4, + 1e-3, + 1e-2, + 1e-1, + 1.0, + 1e1, + 1e2, + 1e3, + 1e4, + 1e5, + ] + nb-replace: 'xgb_min_child_weight = ([\d\.e-]+)' + type: float + xgb_reg_alpha: + choices: + [ + 1e-16, + 1e-15, + 1e-14, + 1e-13, + 1e-12, + 1e-11, + 1e-10, + 1e-9, + 1e-8, + 1e-7, + 1e-6, + 1e-5, + 1e-4, + 1e-3, + 1e-2, + 1e-1, + 1.0, + 1e1, + 1e2, + ] + nb-replace: 'xgb_reg_alpha = ([\d\.e-]+)' + type: float + xgb_reg_lambda: + choices: + [ + 1e-16, + 1e-15, + 1e-14, + 1e-13, + 1e-12, + 1e-11, + 1e-10, + 1e-9, + 1e-8, + 1e-7, + 1e-6, + 1e-5, + 1e-4, + 1e-3, + 1e-2, + 1e-1, + 1.0, + 1e1, + 1e2, + ] + nb-replace: 'xgb_reg_lambda = ([\d\.e-]+)' + type: float + xgb_gamma: + choices: + [ + 1e-16, + 1e-15, + 1e-14, + 1e-13, + 1e-12, + 1e-11, + 1e-10, + 1e-9, + 1e-8, + 1e-7, + 1e-6, + 1e-5, + 1e-4, + 1e-3, + 1e-2, + 1e-1, + 1.0, + 1e1, + 1e2, + ] + nb-replace: 'xgb_gamma = ([\d\.e-]+)' + type: float + xgb_n_estimators: + choices: [100, 200, 500, 1000, 1500, 2000, 3000, 4000, 5000] + nb-replace: 'xgb_n_estimators = (\d+)' + type: int + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + + prod: + description: "run XGBoost model with optimal hyperparams" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: XGBoost + n_random_seeds: + default: 5 + nb-replace: 'n_random_seeds = (\d+)' + xgb_learning_rate: + default: 0 + nb-replace: 'xgb_learning_rate = ([\d\.e-]+)' + xgb_max_depth: + default: 0 + nb-replace: 'xgb_max_depth = (\d+)' + xgb_subsample: + default: 0 + nb-replace: 'xgb_subsample = ([\d\.]+)' + xgb_colsample_bytree: + default: 0 + nb-replace: 'xgb_colsample_bytree = ([\d\.]+)' + xgb_colsample_bylevel: + default: 0 + nb-replace: 'xgb_colsample_bylevel = ([\d\.]+)' + xgb_min_child_weight: + default: 0 + nb-replace: 'xgb_min_child_weight = ([\d\.e-]+)' + xgb_reg_alpha: + default: 0 + nb-replace: 'xgb_reg_alpha = ([\d\.e-]+)' + xgb_reg_lambda: + default: 0 + nb-replace: 'xgb_reg_lambda = ([\d\.e-]+)' + xgb_gamma: + default: 0 + nb-replace: 'xgb_gamma = ([\d\.e-]+)' + xgb_n_estimators: + default: 0 + nb-replace: 'xgb_n_estimators = (\d+)' + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + +- model: lgb + operations: + hyperopt: + description: "hyper-parameter tuning of LightGBM baseline" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: LightGBM + n_random_seeds: + default: 1 + nb-replace: 'n_random_seeds = (\d+)' + lgb_num_leaves: + choices: [5, 10, 20, 30, 40, 50] + nb-replace: 'lgb_num_leaves = (\d+)' + type: int + lgb_max_depth: + choices: + [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] + nb-replace: 'lgb_max_depth = (\d+)' + type: int + lgb_learning_rate: + choices: [1e-3, 1e-2, 1e-1, 1.0] + nb-replace: 'lgb_learning_rate = ([\d\.e-]+)' + type: float + lgb_n_estimators: + choices: [50, 100, 200, 500, 1000, 1500, 2000] + nb-replace: 'lgb_n_estimators = (\d+)' + type: int + lgb_min_child_weight: + choices: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 1e1, 1e2, 1e3, 1e4] + nb-replace: 'lgb_min_child_weight = ([\d\.e-]+)' + type: float + lgb_subsample: + choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8] + nb-replace: 'lgb_subsample = ([\d\.]+)' + type: float + lgb_colsample_bytree: + choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8] + nb-replace: 'lgb_colsample_bytree = ([\d\.]+)' + type: float + lgb_reg_alpha: + choices: [0, 1e-1, 1, 2, 5, 7, 10, 50, 100] + nb-replace: 'lgb_reg_alpha = ([\d\.e-]+)' + lgb_reg_lambda: + choices: [0, 1e-1, 1, 2, 5, 7, 10, 50, 100] + nb-replace: 'lgb_reg_lambda = ([\d\.e-]+)' + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' + + prod: + description: "run LightGBM model with optimal hyperparams" + notebook: baseline.ipynb + flags: + $include: shared-flags + model_name: LightGBM + n_random_seeds: + default: 5 + nb-replace: 'n_random_seeds = (\d+)' + lgb_num_leaves: + default: 0 + nb-replace: 'lgb_num_leaves = (\d+)' + lgb_max_depth: + default: 0 + nb-replace: 'lgb_max_depth = (\d+)' + type: int + lgb_learning_rate: + default: 0 + nb-replace: 'lgb_learning_rate = ([\d\.e-]+)' + lgb_n_estimators: + default: 0 + nb-replace: 'lgb_n_estimators = (\d+)' + lgb_min_child_weight: + default: 0 + nb-replace: 'lgb_min_child_weight = ([\d\.e-]+)' + lgb_subsample: + default: 0 + nb-replace: 'lgb_subsample = ([\d\.]+)' + lgb_colsample_bytree: + default: 0 + nb-replace: 'lgb_colsample_bytree = ([\d\.]+)' + type: float + lgb_reg_alpha: + default: 0 + nb-replace: 'lgb_reg_alpha = ([\d\.e-]+)' + lgb_reg_lambda: + default: 0 + nb-replace: 'lgb_reg_lambda = ([\d\.e-]+)' + + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' diff --git a/experiments/ddxplus/run_origami.py b/experiments/ddxplus/run_origami.py new file mode 100644 index 0000000..dd99ef9 --- /dev/null +++ b/experiments/ddxplus/run_origami.py @@ -0,0 +1,181 @@ +from types import SimpleNamespace + +import numpy as np +from pymongo import MongoClient +from sklearn.pipeline import Pipeline + +from origami.inference import AutoCompleter, Metrics +from origami.model import ORIGAMI +from origami.model.vpda import ObjectVPDA +from origami.preprocessing import ( + DFDataset, + DocPermuterPipe, + DocTokenizerPipe, + KBinsDiscretizerPipe, + PadTruncTokensPipe, + TargetFieldPipe, + TokenEncoderPipe, + UpscalerPipe, + load_df_from_mongodb, +) +from origami.utils.common import set_seed +from origami.utils.config import ModelConfig, PositionEncodingMethod, TrainConfig +from origami.utils.guild import load_secrets, print_guild_scalars + +flags = SimpleNamespace() + +secrets = load_secrets() + +set_seed(flags.seed) + +# load PATHOLOGY fields for test data +client = MongoClient(secrets["MONGO_URI"]) +collection_test = client.ddxplus[f"{flags.eval_data}-noprob"] + +pathologies_test = [ + d["PATHOLOGY"] for d in collection_test.find({}, projection={"PATHOLOGY": 1}, limit=flags.limit, sort=[("_id", 1)]) +] + +# now load data for training and evaluation (test or validate), same sort order +PROJECTION = {"_id": 0, "PATHOLOGY": 0, "DIFFERENTIAL_DIAGNOSIS": 0} +TARGET_FIELD = "DIFFERENTIAL_DIAGNOSIS_NOPROB" + +train_docs_df = load_df_from_mongodb( + secrets["MONGO_URI"], "ddxplus", "train-noprob", projection=PROJECTION, limit=flags.limit, sort=[("_id", 1)] +) + +test_docs_df = load_df_from_mongodb( + secrets["MONGO_URI"], + "ddxplus", + f"{flags.eval_data}-noprob", + projection=PROJECTION, + limit=flags.limit, + sort=[("_id", 1)], +) + +# create train and test pipelines +pipes = { + # --- train only --- + "upscaler": UpscalerPipe(n=2), + "permuter": DocPermuterPipe(), + # --- test only --- + "target": TargetFieldPipe(TARGET_FIELD), + # --- train and test --- + "discretizer": KBinsDiscretizerPipe(bins=128, threshold=128, strategy="kmeans"), + "tokenizer": DocTokenizerPipe(), + "padding": PadTruncTokensPipe(length="max"), + "encoder": TokenEncoderPipe(), +} + +train_pipeline = Pipeline( + [(name, pipes[name]) for name in ("discretizer", "upscaler", "permuter", "tokenizer", "padding", "encoder")] +) +test_pipeline = Pipeline([(name, pipes[name]) for name in ("discretizer", "target", "tokenizer", "padding", "encoder")]) + +# process train and test/validation data +train_pipeline.fit(train_docs_df) +test_pipeline.fit(test_docs_df) +train_df = train_pipeline.transform(train_docs_df) +test_df = test_pipeline.transform(test_docs_df) + +# get stateful objects +encoder = pipes["encoder"].encoder +block_size = pipes["padding"].length + +# print data stats +print(f"len train: {len(train_df)}, len val: {len(test_df)}") +print(f"vocab size {encoder.vocab_size}") +print(f"block size {block_size}") + +# wrap in datasets +train_dataset = DFDataset(train_df) +test_dataset = DFDataset(test_df) + +# model and train configs +model_config = ModelConfig.from_preset(flags.model_size) +model_config.position_encoding = PositionEncodingMethod.KEY_VALUE +model_config.vocab_size = encoder.vocab_size +model_config.block_size = block_size +model_config.mask_field_token_losses = True + +train_config = TrainConfig() + +vpda = ObjectVPDA(encoder) # build VPDA without schema (only doc structure enforced) +model = ORIGAMI(model_config, train_config, vpda=vpda) + +metrics = Metrics(model) + + +def progress_callback(model): + global test_dataset + if model.batch_num % train_config.eval_every == 0: + print_guild_scalars( + step=f"{int(model.batch_num / train_config.eval_every)}", + epoch=model.epoch_num, + batch_num=model.batch_num, + batch_dt=f"{model.batch_dt * 1000:.2f}", + batch_loss=f"{model.loss:.4f}", + lr=f"{model.learning_rate:.2e}", + ) + + +model.set_callback("on_batch_end", progress_callback) +model.train_model(train_dataset, batches=flags.n_batches) +model.save("gpt_checkpoint.pt") + +# --- evaluation --- + +# generation is faster on cpu +model.device = "cpu" + +# optionally evaluate on a smaller subset of the test data +# test_dataset = test_dataset.sample(n=10000) +autocompleter = AutoCompleter(model, encoder, target_field=TARGET_FIELD, max_batch_size=5000, show_progress=False) +completions = autocompleter.autocomplete(test_dataset, decode=True) + +df = test_dataset.df +df["generated"] = completions +df["pathology"] = np.array(pathologies_test) +df["predicted"] = [c[TARGET_FIELD] for c in completions] + + +def get_ddx_arr(ddx_arr): + if not isinstance(ddx_arr, list): + # if model doesn't predict an array, this can happen + # we return an empty list, which will lead to prec = rec = 0 + return [] + + if TARGET_FIELD.endswith("_NOPROB"): + return ddx_arr + + # only return the diagnosis name, not the probability + return [a[0] for a in ddx_arr] + + +ddr = [] +ddp = [] +gtpa_at_1 = [] +gtpa = [] + +for i, row in df.iterrows(): + y_true = get_ddx_arr(row["target"]) + y_pred = get_ddx_arr(row["predicted"]) + + intersection = set(y_true).intersection(set(y_pred)) + ddr.append(len(intersection) / len(y_true)) + ddp.append(len(intersection) / len(y_pred) if len(y_pred) > 0 else 0) + + # is pathology the top diagnosis? + gtpa_at_1.append(int(len(y_pred) > 0 and row["pathology"] == y_pred[0])) + + # is pathology one of the predicted diagnoses? + gtpa.append(int(row["pathology"] in y_pred)) + +ddr = np.mean(ddr) +ddp = np.mean(ddp) +f1 = 2 * ddr * ddp / (ddr + ddp) +gtpa_at_1 = np.mean(gtpa_at_1) +gtpa = np.mean(gtpa) + +print(f"\n Evaluation result for {flags.eval_data} dataset") +print_guild_scalars(ddr=ddr, ddp=ddp, f1=f1, gtpa_at_1=gtpa_at_1, gtpa=gtpa) diff --git a/experiments/ddxplus/utils.py b/experiments/ddxplus/utils.py new file mode 100644 index 0000000..d13c69f --- /dev/null +++ b/experiments/ddxplus/utils.py @@ -0,0 +1,56 @@ +from typing import Dict + +import numpy as np + + +def get_scores( + y_target: np.ndarray, y_pred: np.ndarray, y_pathology: np.ndarray, postfix: str = "" +) -> Dict[str, float]: + ddr = [] # ddx precision + ddp = [] # ddx recall + gtpa = [] # ground truth pathology accuracy + gtpa_at_1 = [] + + for y_target_i, y_pred_i, y_pathology_i in zip(y_target, y_pred, y_pathology): + y_pred_i_ix = set(np.where(y_pred_i > 0.5)[0]) + y_target_i_ix = set(np.where(y_target_i > 0.5)[0]) + + # precision and recall + intersection = y_pred_i_ix.intersection(y_target_i_ix) + + ddr.append(len(intersection) / len(y_target_i_ix)) + if len(y_pred_i_ix) > 0: + ddp.append(len(intersection) / len(y_pred_i_ix)) + else: + ddp.append(0) + + # gtpa + if y_pathology_i in y_pred_i_ix: + gtpa.append(1) + else: + gtpa.append(0) + + # gtpa @ 1 + first_pathology_predicted = y_pred_i.argmax() + if y_pathology_i == first_pathology_predicted: + gtpa_at_1.append(1) + else: + gtpa_at_1.append(0) + + recall = np.mean(ddr) + precision = np.mean(ddp) + if recall + precision <= 1e-6: + f1 = 0 + else: + f1 = 2 * recall * precision / (recall + precision) + + gtpa = np.mean(gtpa) + gtpa_at_1 = np.mean(gtpa_at_1) + + return { + f"recall{postfix}": recall, + f"precision{postfix}": precision, + f"f1{postfix}": f1, + f"gtpa{postfix}": gtpa, + f"gtpa_at_1{postfix}": gtpa_at_1, + } From a2afd6157cb05ea4f0ff2e73226ea3aec704a295 Mon Sep 17 00:00:00 2001 From: Thomas Rueckstiess Date: Mon, 3 Feb 2025 18:00:01 +1100 Subject: [PATCH 3/9] added DDXPlus experiments --- experiments/ddxplus/README.md | 62 ++- experiments/ddxplus/baseline.ipynb | 669 +++++++++++++++++++++++++--- experiments/ddxplus/baseline.py | 57 --- experiments/ddxplus/guild.yml | 90 +--- experiments/ddxplus/run_baseline.py | 218 +++++++++ experiments/ddxplus/run_origami.py | 22 +- experiments/json2vec/README.md | 2 +- requirements.txt | 6 +- 8 files changed, 891 insertions(+), 235 deletions(-) delete mode 100644 experiments/ddxplus/baseline.py create mode 100644 experiments/ddxplus/run_baseline.py diff --git a/experiments/ddxplus/README.md b/experiments/ddxplus/README.md index 1504062..a140177 100644 --- a/experiments/ddxplus/README.md +++ b/experiments/ddxplus/README.md @@ -2,7 +2,7 @@ In this experiment we train the model on the [DDXPlus dataset](https://arxiv.org/abs/2205.09148), a dataset for automated medical diagnosis. We devise a task to predict the most likely differential diagnoses for each instance, a multi-label prediction task. -For ORIGAMI, we reformat the dataset into JSON format with two different representations: +For ORiGAMi, we reformat the dataset into JSON format with two different representations: - A flat representation, in which we store the evidences and their values as strings. - An object representation, where the evidences are stored as object containing array values. @@ -15,61 +15,51 @@ First, make sure you have restored the datasets from the mongo dump file as desc ## ORiGAMi -## Baselines - -### Single Run +We train a model with the `medium` size preset by default: 6 layers, 6 heads, 192 embedding dimensionality. To train with other model sizes, append `model_size=` to the command, using one of the following options: `xs`, `small`, `medium`, `large`, `xl`. -Running a single model, using `guild.yml` settings, as: +To train and evaluate ORiGAMi on the flat evidences structure, run the following: ```bash -guild run lr:hyperopt lr_C=10.0 +guild run origami:train evidences=flat eval_data=test seed="[1, 2, 3, 4, 5]" ``` -**Note**: parameters passed through the CLI overwrite values in the notebook and in `guild.yml`. - -It is also possible to run the notebook directly, though this will create a number of additional quantities tracked -by guild (e.g. `TARGET_FIELD`) which we may not be interested in, and as such, this method is for quick local checks only. +For the object representation of evidences, run instead: ```bash -guild run baseline.ipynb model_name=LogisticRegression lr_C=0.1 +guild run origami:train evidences=object eval_data=test seed="[1, 2, 3, 4, 5]" ``` -### Experimental Runs +This will repeat the training and evaluation 5 times with different random seeds and evaluate on the test set. + +## Baselines + +### Hyperparameter optimization -First perform HPO, supplying `limit=0` and the appropriate number of `--max-trials`: +First perform HPO, supplying the `` as one of `lr` (Logistic Regression), `rf` (Random Forest), `xgb` (XGBoost), `lgb` (LightGBM) and the appropriate number of trial runs with `--max-trials `, and give the run a name with `