mongodb-labs · rueckstiess · Feb 4, 2025 · Feb 3, 2025 · Feb 3, 2025 · Feb 3, 2025
diff --git a/experiments/README.md b/experiments/README.md
@@ -0,0 +1,27 @@
+# Reproducing the results from our paper
+
+This directory contains the code and instructions to reproduce the experiments from our paper:
+[ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348).
+
+There are 3 sub-directories, each with their own `README.md` file:
+
+- [`json2vec`](./json2vec/README.md) contains the experiments from section 3.1, where we compare on standard tabular benchmarks that have been converted to JSON against various baselines and the json2vec models from [A Framework for End-to-End Learning on Semantic Tree-Structured Data](https://arxiv.org/abs/2002.05707) by William Woof and Ke Chen.
+- [`ddxplus`](./ddxplus/README.md) contains the experiments from section 3.2 for a medical diagnosis task on patient information. This experiment demonstrates prediction of multi-token values representing arrays of possible pathologies.
+- [`codenet`](./codenet/README.md) contains the experiments from section 3.3 related to a Java code classification task. Here we demonstrate the model's ability to deal with complex and deeply nested JSON objects.
+
+### Experiment Tracking
+
+We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking.
+
+### Datasets
+
+We bundled all datasets used in the paper in a [MongoDB dump file](). To reproduce the results, first
+you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance:
+
+```
+mongorestore dump/
+```
+
+This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data.
+
+If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in each sub-directory accordingly.
diff --git a/experiments/prediction/.env.local → experiments/codenet/.env.local b/experiments/prediction/.env.local → experiments/codenet/.env.local
diff --git a/experiments/codenet/README.md b/experiments/codenet/README.md
@@ -0,0 +1,20 @@
+# CodeNet Java Experiments
+
+In this experiment, we convert Java code snippets from the [CodeNet](https://developer.ibm.com/exchanges/data/all/project-codenet/) dataset into Abstract Syntax Trees and store them as JSON objects.
+We then train an ORiGAMi model on these ASTs for a classification task, where the programming problem ID is the target label. More details on the dataset and classification task can be found
+in the paper [CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks](https://arxiv.org/abs/2105.12655) by Ruchir Puri et al.
+
+First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `codenet` directory.
+
+### Training and evaluating the model
+
+Due to resource constraints, we did not perform a hyperparameter optimization. We use a model with 4 transformer layers, 4 heads and 192 embedding dimensionality. All parameters are
+configured as defaults in the `guild.yml` file.
+
+To run the training and evaluation on the test set, use:
+
+```bash
+guild run train
+```
+
+Note: Training with the default parameters requires est. 50 GB of GPU RAM.
diff --git a/experiments/codenet/guild.yml b/experiments/codenet/guild.yml
@@ -0,0 +1,18 @@
+train:
+  description: Train a model on the codenet Java dataset
+  main: train
+  flags-dest: namespace:flags
+  flags:
+    n_batches: 200000
+    n_problems: 250
+    batch_size: 8
+    learning_rate: 1e-3
+    n_embd: 192
+    max_tokens: 4000
+    max_length: 4000
+    eval_every: 1000
+
+  # matches the guild_output_scalars() helper function
+  output-scalars:
+    - step: '\|  step: (\step)'
+    - '\|  (\key): (\value)'
diff --git a/experiments/codenet/train.py b/experiments/codenet/train.py
@@ -0,0 +1,201 @@
+from pathlib import Path
+from types import SimpleNamespace
+
+from pymongo import MongoClient
+from sklearn.pipeline import Pipeline
+
+from origami.inference import Predictor
+from origami.model import ORIGAMI
+from origami.model.vpda import ObjectVPDA
+from origami.preprocessing import (
+    DFDataset,
+    DocPermuterPipe,
+    DocTokenizerPipe,
+    PadTruncTokensPipe,
+    TargetFieldPipe,
+    TokenEncoderPipe,
+    UpscalerPipe,
+    load_df_from_mongodb,
+)
+from origami.utils.common import set_seed
+from origami.utils.config import GuardrailsMethod, ModelConfig, PositionEncodingMethod, TrainConfig
+from origami.utils.guild import load_secrets, print_guild_scalars
+
+# populated by guild
+flags = SimpleNamespace()
+secrets = load_secrets()
+
+# for reproducibility
+set_seed(1234)
+
+TARGET_FIELD = "problem"
+UPSCALE = 2
+
+client = MongoClient(secrets["MONGO_URI"])
+collection = client["codenet_java"].train
+
+target_problems = collection.distinct(TARGET_FIELD)
+num_problems = len(target_problems)
+
+target_problems = target_problems[: flags.n_problems]
+print(f"training on {flags.n_problems} problems (out of {num_problems})")
+
+# load data into dataframe for train/test
+
+train_docs_df = load_df_from_mongodb(
+    "mongodb://localhost:27017",
+    "codenet_java",
+    "train",
+    filter={"problem": {"$in": target_problems}},
+    projection={"_id": 0, "filePath": 0},
+)
+
+test_docs_df = load_df_from_mongodb(
+    "mongodb://localhost:27017",
+    "codenet_java",
+    "test",
+    filter={"problem": {"$in": target_problems}},
+    projection={"_id": 0, "filePath": 0},
+)
+
+num_train_inst = len(train_docs_df)
+num_test_inst = len(test_docs_df)
+
+# create train and test pipelines
+pipes = {
+    # --- train only ---
+    "upscaler": UpscalerPipe(n=UPSCALE),
+    "permuter": DocPermuterPipe(shuffle_arrays=True),
+    # --- test only ---
+    "target": TargetFieldPipe(TARGET_FIELD),
+    # --- train and test ---
+    "tokenizer": DocTokenizerPipe(path_in_field_tokens=False),
+    "padding": PadTruncTokensPipe(length=flags.max_length),
+    "encoder": TokenEncoderPipe(max_tokens=flags.max_tokens),
+}
+
+train_pipeline = Pipeline(
+    [(name, pipes[name]) for name in ("target", "upscaler", "permuter", "tokenizer", "padding", "encoder")],
+    verbose=True,
+)
+test_pipeline = Pipeline([(name, pipes[name]) for name in ("target", "tokenizer", "padding", "encoder")], verbose=True)
+
+# process train, eval and test data (first fit both, then transform)
+train_pipeline.fit(train_docs_df)
+test_pipeline.fit(test_docs_df)
+
+train_df = train_pipeline.transform(train_docs_df)
+test_df = test_pipeline.transform(test_docs_df)
+
+# drop ordered_docs columns to save space
+train_df.drop(columns=["docs"], inplace=True)
+test_df.drop(columns=["docs"], inplace=True)
+
+# drop all rows where the tokens array doesn't end in 0 (longer than max_length)
+train_df = train_df[train_df["tokens"].apply(lambda x: x[-1] == 0)]
+test_df = test_df[test_df["tokens"].apply(lambda x: x[-1] == 0)]
+
+# get stateful objects
+encoder = pipes["encoder"].encoder
+block_size = pipes["padding"].length
+
+# print data stats
+print(
+    f"dropped {(1 - (len(train_df) / (UPSCALE * num_train_inst))) * 100:.2f}% training instances, and "
+    f"{(1 - (len(test_df) / num_test_inst)) * 100:.2f}% test instances."
+)
+print(f"vocab size {encoder.vocab_size}")
+print(f"block size {block_size}")
+
+# confirm that all targets are in the vocabulary
+for target in train_df["target"].unique():
+    enc = encoder.encode(target)
+    assert target == encoder.decode(enc), f"token not {target} represented in vocab."
+
+for target in test_df["target"].unique():
+    enc = encoder.encode(target)
+    assert target == encoder.decode(enc), f"token not {target} represented in vocab."
+
+# create datasets, VPDA and model
+
+# model and train configs
+model_config = ModelConfig.from_preset("small")
+model_config.position_encoding = PositionEncodingMethod.KEY_VALUE
+model_config.vocab_size = encoder.vocab_size
+model_config.block_size = block_size
+model_config.n_embd = flags.n_embd
+model_config.mask_field_token_losses = False
+model_config.tie_weights = False
+model_config.guardrails = GuardrailsMethod.STRUCTURE_ONLY
+model_config.fuse_pos_with_mlp = True
+
+train_config = TrainConfig()
+train_config.learning_rate = flags.learning_rate
+train_config.batch_size = flags.batch_size
+train_config.n_warmup_batches = 100
+train_config.eval_every = flags.eval_every
+
+# datasets
+train_dataset = DFDataset(train_df)
+test_dataset = DFDataset(test_df)
+
+vpda = ObjectVPDA(encoder)
+model = ORIGAMI(model_config, train_config, vpda=vpda)
+
+# load model checkpoint if it exists
+checkpoint_file = Path("./gpt-codenet-snapshot.pt")
+if checkpoint_file.is_file():
+    model.load("gpt-codenet-snapshot.pt")
+    print(f"loading existing checkpoint at batch_num {model.batch_num}...")
+
+
+# create a predictor
+predictor = Predictor(model, encoder, TARGET_FIELD)
+
+
+def progress_callback(model):
+    print_guild_scalars(
+        step=f"{int(model.batch_num)}",
+        epoch=model.epoch_num,
+        batch_num=model.batch_num,
+        batch_dt=f"{model.batch_dt * 1000:.2f}",
+        batch_loss=f"{model.loss:.4f}",
+        lr=f"{model.learning_rate:.2e}",
+    )
+    if model.batch_num % train_config.eval_every == 0:
+        try:
+            # train_acc = predictor.accuracy(train_dataset.sample(n=100))
+            test_acc = predictor.accuracy(test_dataset.sample(n=100), show_progress=True)
+            print_guild_scalars(
+                step=f"{int(model.batch_num)}",
+                # train_acc=f"{train_acc:.4f}",
+                test_acc=f"{test_acc:.4f}",
+            )
+            # print(f"Train accuracy @ 100: {train_acc:.4f}, Test accuracy @ 100: {test_acc:.4f}")
+        except AssertionError as e:
+            print(e)
+            print("continuing...")
+
+        model.save("gpt-codenet-snapshot.pt")
+        print("model saved to gpt-codenet-snapshot.pt")
+
+
+model.set_callback("on_batch_end", progress_callback)
+
+try:
+    model.train_model(train_dataset, batches=flags.n_batches)
+except KeyboardInterrupt:
+    pass
+
+# final save
+model.save("gpt-codenet-snapshot.pt")
+print("model saved to gpt-codenet-snapshot.pt")
+
+test_acc = predictor.accuracy(test_dataset, show_progress=True)
+print_guild_scalars(
+    step=f"{int(model.batch_num / train_config.eval_every)}",
+    test_acc=f"{test_acc:.4f}",
+)
+
+dropped_ratio = 1 - (len(test_df) / num_test_inst)
+print(f"Final test accuracy when taking into account the dropped instances: {(1 - dropped_ratio) * test_acc:.4f}%")
diff --git a/experiments/ddxplus/.env.local b/experiments/ddxplus/.env.local
@@ -0,0 +1,2 @@
+MONGO_URI="mongodb://localhost:27017"
+DATABASE=ddxplus
diff --git a/experiments/ddxplus/README.md b/experiments/ddxplus/README.md
@@ -0,0 +1,65 @@
+# DDXPlus Experiments
+
+In this experiment we train the model on the [DDXPlus dataset](https://arxiv.org/abs/2205.09148), a dataset for automated medical diagnosis. We devise a task to predict the most likely differential diagnoses for each instance, a multi-label prediction task.
+
+For ORiGAMi, we reformat the dataset into JSON format with two different representations:
+
+- A flat representation, in which we store the evidences and their values as strings.
+- An object representation, where the evidences are stored as object containing array values.
+
+We compare our model against baselines: Logistic Regression, Random Forests, XGBoost, LightGBM. The baselines are trained on a
+flat representation by converting the evidence-value strings into a multi-label binary matrix. We wrap each model in a scikit-learn
+[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html).
+
+First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `ddxplus` directory.
+
+## ORiGAMi
+
+We train a model with the `medium` size preset by default: 6 layers, 6 heads, 192 embedding dimensionality. To train with other model sizes, append `model_size=<size>` to the command, using one of the following options: `xs`, `small`, `medium`, `large`, `xl`.
+
+To train and evaluate ORiGAMi on the flat evidences structure, run the following:
+
+```bash
+guild run origami:train evidences=flat eval_data=test seed="[1, 2, 3, 4, 5]"
+```
+
+For the object representation of evidences, run instead:
+
+```bash
+guild run origami:train evidences=object eval_data=test seed="[1, 2, 3, 4, 5]"
+```
+
+This will repeat the training and evaluation 5 times with different random seeds and evaluate on the test set.
+
+## Baselines
+
+### Hyperparameter optimization
+
+First perform HPO, supplying the `<model>` as one of `lr` (Logistic Regression), `rf` (Random Forest), `xgb` (XGBoost), `lgb` (LightGBM) and the appropriate number of trial runs with `--max-trials <num>`, and give the run a name with `<label>`, e.g.
+
+```bash
+ NUMPY_EXPERIMENTAL_DTYPE_API=1 guild run lr:hyperopt --optimizer random --max-trials 20 --label <label>
+```
+
+To find the best parameters on the validation dataset, use:
+
+```bash
+guild compare -Fl <label> -u
+```
+
+Sort the `f1_val_mean` column in descending order (press `S` key) and pick the run ID (first column) of the best configuration.
+
+Get the hyperparameters (= flags) with `guild runs info <run-id>`.
+
+### Evaluate best hyperparameters on test dataset
+
+Once the optimal hyperparameters are found, run the model with the optimal hyperparameters, e.g.:
+
+```bash
+guild run lr:train <param1>=<value1> <param2=value2> ...
+```
+
+Replace the `<param>` and `<value>` placeholders with the optimal hyperparameters. You can ignore `model_name` and `n_random_seeds` here.
+By default, the evaluation is done 5 times with different random seeds.
+
+The `<metric>_test_mean` and `<metric_test_val>` scores show the evaluation on the test dataset, where `<metric>` is one of `f1`, `precision`, `recall`.
diff --git a/experiments/prediction/__init__.py → experiments/ddxplus/__init__.py b/experiments/prediction/__init__.py → experiments/ddxplus/__init__.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		MONGO_URI="mongodb://localhost:27017"
		DATABASE=ddxplus