diff --git a/experiments/README.md b/experiments/README.md
new file mode 100644
index 0000000..4646b19
--- /dev/null
+++ b/experiments/README.md
@@ -0,0 +1,27 @@
+# Reproducing the results from our paper
+
+This directory contains the code and instructions to reproduce the experiments from our paper:
+[ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348).
+
+There are 3 sub-directories, each with their own `README.md` file:
+
+- [`json2vec`](./json2vec/README.md) contains the experiments from section 3.1, where we compare on standard tabular benchmarks that have been converted to JSON against various baselines and the json2vec models from [A Framework for End-to-End Learning on Semantic Tree-Structured Data](https://arxiv.org/abs/2002.05707) by William Woof and Ke Chen.
+- [`ddxplus`](./ddxplus/README.md) contains the experiments from section 3.2 for a medical diagnosis task on patient information. This experiment demonstrates prediction of multi-token values representing arrays of possible pathologies.
+- [`codenet`](./codenet/README.md) contains the experiments from section 3.3 related to a Java code classification task. Here we demonstrate the model's ability to deal with complex and deeply nested JSON objects.
+
+### Experiment Tracking
+
+We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking.
+
+### Datasets
+
+We bundled all datasets used in the paper in a [MongoDB dump file](). To reproduce the results, first
+you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance:
+
+```
+mongorestore dump/
+```
+
+This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data.
+
+If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in each sub-directory accordingly.
diff --git a/experiments/prediction/.env.local b/experiments/codenet/.env.local
similarity index 100%
rename from experiments/prediction/.env.local
rename to experiments/codenet/.env.local
diff --git a/experiments/codenet/README.md b/experiments/codenet/README.md
new file mode 100644
index 0000000..305e8f2
--- /dev/null
+++ b/experiments/codenet/README.md
@@ -0,0 +1,20 @@
+# CodeNet Java Experiments
+
+In this experiment, we convert Java code snippets from the [CodeNet](https://developer.ibm.com/exchanges/data/all/project-codenet/) dataset into Abstract Syntax Trees and store them as JSON objects.
+We then train an ORiGAMi model on these ASTs for a classification task, where the programming problem ID is the target label. More details on the dataset and classification task can be found
+in the paper [CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks](https://arxiv.org/abs/2105.12655) by Ruchir Puri et al.
+
+First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `codenet` directory.
+
+### Training and evaluating the model
+
+Due to resource constraints, we did not perform a hyperparameter optimization. We use a model with 4 transformer layers, 4 heads and 192 embedding dimensionality. All parameters are
+configured as defaults in the `guild.yml` file.
+
+To run the training and evaluation on the test set, use:
+
+```bash
+guild run train
+```
+
+Note: Training with the default parameters requires est. 50 GB of GPU RAM.
diff --git a/experiments/codenet/guild.yml b/experiments/codenet/guild.yml
new file mode 100644
index 0000000..3ad5fe3
--- /dev/null
+++ b/experiments/codenet/guild.yml
@@ -0,0 +1,18 @@
+train:
+  description: Train a model on the codenet Java dataset
+  main: train
+  flags-dest: namespace:flags
+  flags:
+    n_batches: 200000
+    n_problems: 250
+    batch_size: 8
+    learning_rate: 1e-3
+    n_embd: 192
+    max_tokens: 4000
+    max_length: 4000
+    eval_every: 1000
+
+  # matches the guild_output_scalars() helper function
+  output-scalars:
+    - step: '\|  step: (\step)'
+    - '\|  (\key): (\value)'
diff --git a/experiments/codenet/train.py b/experiments/codenet/train.py
new file mode 100644
index 0000000..a40fe4d
--- /dev/null
+++ b/experiments/codenet/train.py
@@ -0,0 +1,201 @@
+from pathlib import Path
+from types import SimpleNamespace
+
+from pymongo import MongoClient
+from sklearn.pipeline import Pipeline
+
+from origami.inference import Predictor
+from origami.model import ORIGAMI
+from origami.model.vpda import ObjectVPDA
+from origami.preprocessing import (
+    DFDataset,
+    DocPermuterPipe,
+    DocTokenizerPipe,
+    PadTruncTokensPipe,
+    TargetFieldPipe,
+    TokenEncoderPipe,
+    UpscalerPipe,
+    load_df_from_mongodb,
+)
+from origami.utils.common import set_seed
+from origami.utils.config import GuardrailsMethod, ModelConfig, PositionEncodingMethod, TrainConfig
+from origami.utils.guild import load_secrets, print_guild_scalars
+
+# populated by guild
+flags = SimpleNamespace()
+secrets = load_secrets()
+
+# for reproducibility
+set_seed(1234)
+
+TARGET_FIELD = "problem"
+UPSCALE = 2
+
+client = MongoClient(secrets["MONGO_URI"])
+collection = client["codenet_java"].train
+
+target_problems = collection.distinct(TARGET_FIELD)
+num_problems = len(target_problems)
+
+target_problems = target_problems[: flags.n_problems]
+print(f"training on {flags.n_problems} problems (out of {num_problems})")
+
+# load data into dataframe for train/test
+
+train_docs_df = load_df_from_mongodb(
+    "mongodb://localhost:27017",
+    "codenet_java",
+    "train",
+    filter={"problem": {"$in": target_problems}},
+    projection={"_id": 0, "filePath": 0},
+)
+
+test_docs_df = load_df_from_mongodb(
+    "mongodb://localhost:27017",
+    "codenet_java",
+    "test",
+    filter={"problem": {"$in": target_problems}},
+    projection={"_id": 0, "filePath": 0},
+)
+
+num_train_inst = len(train_docs_df)
+num_test_inst = len(test_docs_df)
+
+# create train and test pipelines
+pipes = {
+    # --- train only ---
+    "upscaler": UpscalerPipe(n=UPSCALE),
+    "permuter": DocPermuterPipe(shuffle_arrays=True),
+    # --- test only ---
+    "target": TargetFieldPipe(TARGET_FIELD),
+    # --- train and test ---
+    "tokenizer": DocTokenizerPipe(path_in_field_tokens=False),
+    "padding": PadTruncTokensPipe(length=flags.max_length),
+    "encoder": TokenEncoderPipe(max_tokens=flags.max_tokens),
+}
+
+train_pipeline = Pipeline(
+    [(name, pipes[name]) for name in ("target", "upscaler", "permuter", "tokenizer", "padding", "encoder")],
+    verbose=True,
+)
+test_pipeline = Pipeline([(name, pipes[name]) for name in ("target", "tokenizer", "padding", "encoder")], verbose=True)
+
+# process train, eval and test data (first fit both, then transform)
+train_pipeline.fit(train_docs_df)
+test_pipeline.fit(test_docs_df)
+
+train_df = train_pipeline.transform(train_docs_df)
+test_df = test_pipeline.transform(test_docs_df)
+
+# drop ordered_docs columns to save space
+train_df.drop(columns=["docs"], inplace=True)
+test_df.drop(columns=["docs"], inplace=True)
+
+# drop all rows where the tokens array doesn't end in 0 (longer than max_length)
+train_df = train_df[train_df["tokens"].apply(lambda x: x[-1] == 0)]
+test_df = test_df[test_df["tokens"].apply(lambda x: x[-1] == 0)]
+
+# get stateful objects
+encoder = pipes["encoder"].encoder
+block_size = pipes["padding"].length
+
+# print data stats
+print(
+    f"dropped {(1 - (len(train_df) / (UPSCALE * num_train_inst))) * 100:.2f}% training instances, and "
+    f"{(1 - (len(test_df) / num_test_inst)) * 100:.2f}% test instances."
+)
+print(f"vocab size {encoder.vocab_size}")
+print(f"block size {block_size}")
+
+# confirm that all targets are in the vocabulary
+for target in train_df["target"].unique():
+    enc = encoder.encode(target)
+    assert target == encoder.decode(enc), f"token not {target} represented in vocab."
+
+for target in test_df["target"].unique():
+    enc = encoder.encode(target)
+    assert target == encoder.decode(enc), f"token not {target} represented in vocab."
+
+# create datasets, VPDA and model
+
+# model and train configs
+model_config = ModelConfig.from_preset("small")
+model_config.position_encoding = PositionEncodingMethod.KEY_VALUE
+model_config.vocab_size = encoder.vocab_size
+model_config.block_size = block_size
+model_config.n_embd = flags.n_embd
+model_config.mask_field_token_losses = False
+model_config.tie_weights = False
+model_config.guardrails = GuardrailsMethod.STRUCTURE_ONLY
+model_config.fuse_pos_with_mlp = True
+
+train_config = TrainConfig()
+train_config.learning_rate = flags.learning_rate
+train_config.batch_size = flags.batch_size
+train_config.n_warmup_batches = 100
+train_config.eval_every = flags.eval_every
+
+# datasets
+train_dataset = DFDataset(train_df)
+test_dataset = DFDataset(test_df)
+
+vpda = ObjectVPDA(encoder)
+model = ORIGAMI(model_config, train_config, vpda=vpda)
+
+# load model checkpoint if it exists
+checkpoint_file = Path("./gpt-codenet-snapshot.pt")
+if checkpoint_file.is_file():
+    model.load("gpt-codenet-snapshot.pt")
+    print(f"loading existing checkpoint at batch_num {model.batch_num}...")
+
+
+# create a predictor
+predictor = Predictor(model, encoder, TARGET_FIELD)
+
+
+def progress_callback(model):
+    print_guild_scalars(
+        step=f"{int(model.batch_num)}",
+        epoch=model.epoch_num,
+        batch_num=model.batch_num,
+        batch_dt=f"{model.batch_dt * 1000:.2f}",
+        batch_loss=f"{model.loss:.4f}",
+        lr=f"{model.learning_rate:.2e}",
+    )
+    if model.batch_num % train_config.eval_every == 0:
+        try:
+            # train_acc = predictor.accuracy(train_dataset.sample(n=100))
+            test_acc = predictor.accuracy(test_dataset.sample(n=100), show_progress=True)
+            print_guild_scalars(
+                step=f"{int(model.batch_num)}",
+                # train_acc=f"{train_acc:.4f}",
+                test_acc=f"{test_acc:.4f}",
+            )
+            # print(f"Train accuracy @ 100: {train_acc:.4f}, Test accuracy @ 100: {test_acc:.4f}")
+        except AssertionError as e:
+            print(e)
+            print("continuing...")
+
+        model.save("gpt-codenet-snapshot.pt")
+        print("model saved to gpt-codenet-snapshot.pt")
+
+
+model.set_callback("on_batch_end", progress_callback)
+
+try:
+    model.train_model(train_dataset, batches=flags.n_batches)
+except KeyboardInterrupt:
+    pass
+
+# final save
+model.save("gpt-codenet-snapshot.pt")
+print("model saved to gpt-codenet-snapshot.pt")
+
+test_acc = predictor.accuracy(test_dataset, show_progress=True)
+print_guild_scalars(
+    step=f"{int(model.batch_num / train_config.eval_every)}",
+    test_acc=f"{test_acc:.4f}",
+)
+
+dropped_ratio = 1 - (len(test_df) / num_test_inst)
+print(f"Final test accuracy when taking into account the dropped instances: {(1 - dropped_ratio) * test_acc:.4f}%")
diff --git a/experiments/ddxplus/.env.local b/experiments/ddxplus/.env.local
new file mode 100644
index 0000000..9d63cbd
--- /dev/null
+++ b/experiments/ddxplus/.env.local
@@ -0,0 +1,2 @@
+MONGO_URI="mongodb://localhost:27017"
+DATABASE=ddxplus
\ No newline at end of file
diff --git a/experiments/ddxplus/README.md b/experiments/ddxplus/README.md
new file mode 100644
index 0000000..6f7c982
--- /dev/null
+++ b/experiments/ddxplus/README.md
@@ -0,0 +1,65 @@
+# DDXPlus Experiments
+
+In this experiment we train the model on the [DDXPlus dataset](https://arxiv.org/abs/2205.09148), a dataset for automated medical diagnosis. We devise a task to predict the most likely differential diagnoses for each instance, a multi-label prediction task.
+
+For ORiGAMi, we reformat the dataset into JSON format with two different representations:
+
+- A flat representation, in which we store the evidences and their values as strings.
+- An object representation, where the evidences are stored as object containing array values.
+
+We compare our model against baselines: Logistic Regression, Random Forests, XGBoost, LightGBM. The baselines are trained on a
+flat representation by converting the evidence-value strings into a multi-label binary matrix. We wrap each model in a scikit-learn
+[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html).
+
+First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `ddxplus` directory.
+
+## ORiGAMi
+
+We train a model with the `medium` size preset by default: 6 layers, 6 heads, 192 embedding dimensionality. To train with other model sizes, append `model_size=<size>` to the command, using one of the following options: `xs`, `small`, `medium`, `large`, `xl`.
+
+To train and evaluate ORiGAMi on the flat evidences structure, run the following:
+
+```bash
+guild run origami:train evidences=flat eval_data=test seed="[1, 2, 3, 4, 5]"
+```
+
+For the object representation of evidences, run instead:
+
+```bash
+guild run origami:train evidences=object eval_data=test seed="[1, 2, 3, 4, 5]"
+```
+
+This will repeat the training and evaluation 5 times with different random seeds and evaluate on the test set.
+
+## Baselines
+
+### Hyperparameter optimization
+
+First perform HPO, supplying the `<model>` as one of `lr` (Logistic Regression), `rf` (Random Forest), `xgb` (XGBoost), `lgb` (LightGBM) and the appropriate number of trial runs with `--max-trials <num>`, and give the run a name with `<label>`, e.g.
+
+```bash
+ NUMPY_EXPERIMENTAL_DTYPE_API=1 guild run lr:hyperopt --optimizer random --max-trials 20 --label <label>
+```
+
+To find the best parameters on the validation dataset, use:
+
+```bash
+guild compare -Fl <label> -u
+```
+
+Sort the `f1_val_mean` column in descending order (press `S` key) and pick the run ID (first column) of the best configuration.
+
+Get the hyperparameters (= flags) with `guild runs info <run-id>`.
+
+### Evaluate best hyperparameters on test dataset
+
+Once the optimal hyperparameters are found, run the model with the optimal hyperparameters, e.g.:
+
+```bash
+guild run lr:train <param1>=<value1> <param2=value2> ...
+```
+
+Replace the `<param>` and `<value>` placeholders with the optimal hyperparameters. You can ignore `model_name` and `n_random_seeds` here.
+By default, the evaluation is done 5 times with different random seeds.
+
+The `<metric>_test_mean` and `<metric_test_val>` scores show the evaluation on the test dataset, where `<metric>` is one of `f1`, `precision`, `recall`.
diff --git a/experiments/prediction/__init__.py b/experiments/ddxplus/__init__.py
similarity index 100%
rename from experiments/prediction/__init__.py
rename to experiments/ddxplus/__init__.py
diff --git a/experiments/ddxplus/guild.yml b/experiments/ddxplus/guild.yml
new file mode 100644
index 0000000..c264ede
--- /dev/null
+++ b/experiments/ddxplus/guild.yml
@@ -0,0 +1,368 @@
+- model: origami
+  operations:
+    train:
+      main: run_origami
+      flags-dest: namespace:flags
+      flags:
+        model_size:
+          default: medium
+          choices: [xs, small, medium, large, xl]
+        seed: 1234
+        n_batches: 33000
+        eval_data:
+          default: validate
+          choices: [validate, test]
+        evidences:
+          default: flat
+          choices: [flat, object]
+        limit: 0
+        verbose: False
+
+      requires:
+        - file: .env.local
+        - file: .env.remote
+
+      # matches the guild_output_scalars() helper function
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+- config: shared-flags
+  flags:
+    limit:
+      default: 0
+      type: int
+
+- model: lr
+  operations:
+    hyperopt:
+      description: "hyper-parameter tuning of LogisticRegression baseline"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: LogisticRegression
+        n_random_seeds:
+          default: 1
+        lr_penalty:
+          choices: ["l1", "l2", "none"]
+        lr_max_iter:
+          choices: [10, 50, 100, 300, 500, 1000, 5000]
+          type: int
+        lr_fit_intercept:
+          choices: [True, False]
+        lr_C:
+          choices: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5]
+          type: float
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+    train:
+      description: "run LogisticRegression model with optimal hyperparams"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: LogisticRegression
+        n_random_seeds:
+          default: 5
+        lr_penalty:
+          default: "CHANGE HERE"
+        lr_max_iter:
+          default: 0
+          type: int
+        lr_fit_intercept:
+          default: "CHANGE HERE"
+        lr_C:
+          default: 0
+          type: float
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+- model: rf
+  operations:
+    hyperopt:
+      description: "hyper-parameter tuning of RandomForest baseline"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: RandomForest
+        n_random_seeds:
+          default: 1
+        rf_n_estimators:
+          choices: [20, 50, 100, 150, 200]
+          type: int
+        rf_max_features:
+          choices: ["log2", "sqrt", "none"]
+        rf_max_depth:
+          choices: [0, 1, 5, 10, 20, 30, 45, "none"]
+        rf_min_samples_split:
+          choices: [5, 10]
+          type: int
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+    train:
+      description: "run RandomForest model with optimal hyperparams"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: RandomForest
+        n_random_seeds:
+          default: 5
+        rf_n_estimators:
+          default: 0
+          type: int
+        rf_max_features:
+          default: "CHANGE HERE"
+        rf_max_depth:
+          default: 0
+        rf_min_samples_split:
+          default: 0
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+- model: xgb
+  operations:
+    hyperopt:
+      description: "hyper-parameter tuning of XGBoost baseline"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: XGBoost
+        n_random_seeds:
+          default: 1
+        xgb_learning_rate:
+          choices: [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]
+          type: float
+        xgb_max_depth:
+          choices: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+          type: int
+        xgb_subsample:
+          choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
+          type: float
+        xgb_colsample_bytree:
+          choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
+          type: float
+        xgb_colsample_bylevel:
+          choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
+          type: float
+        xgb_min_child_weight:
+          choices:
+            [
+              1e-16,
+              1e-15,
+              1e-14,
+              1e-13,
+              1e-12,
+              1e-11,
+              1e-10,
+              1e-9,
+              1e-8,
+              1e-7,
+              1e-6,
+              1e-5,
+              1e-4,
+              1e-3,
+              1e-2,
+              1e-1,
+              1.0,
+              1e1,
+              1e2,
+              1e3,
+              1e4,
+              1e5,
+            ]
+          type: float
+        xgb_reg_alpha:
+          choices:
+            [
+              1e-16,
+              1e-15,
+              1e-14,
+              1e-13,
+              1e-12,
+              1e-11,
+              1e-10,
+              1e-9,
+              1e-8,
+              1e-7,
+              1e-6,
+              1e-5,
+              1e-4,
+              1e-3,
+              1e-2,
+              1e-1,
+              1.0,
+              1e1,
+              1e2,
+            ]
+          type: float
+        xgb_reg_lambda:
+          choices:
+            [
+              1e-16,
+              1e-15,
+              1e-14,
+              1e-13,
+              1e-12,
+              1e-11,
+              1e-10,
+              1e-9,
+              1e-8,
+              1e-7,
+              1e-6,
+              1e-5,
+              1e-4,
+              1e-3,
+              1e-2,
+              1e-1,
+              1.0,
+              1e1,
+              1e2,
+            ]
+          type: float
+        xgb_gamma:
+          choices:
+            [
+              1e-16,
+              1e-15,
+              1e-14,
+              1e-13,
+              1e-12,
+              1e-11,
+              1e-10,
+              1e-9,
+              1e-8,
+              1e-7,
+              1e-6,
+              1e-5,
+              1e-4,
+              1e-3,
+              1e-2,
+              1e-1,
+              1.0,
+              1e1,
+              1e2,
+            ]
+          type: float
+        xgb_n_estimators:
+          choices: [100, 200, 500, 1000, 1500, 2000, 3000, 4000, 5000]
+          type: int
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+    train:
+      description: "run XGBoost model with optimal hyperparams"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: XGBoost
+        n_random_seeds:
+          default: 5
+        xgb_learning_rate:
+          default: 0
+        xgb_max_depth:
+          default: 0
+        xgb_subsample:
+          default: 0
+        xgb_colsample_bytree:
+          default: 0
+        xgb_colsample_bylevel:
+          default: 0
+        xgb_min_child_weight:
+          default: 0
+        xgb_reg_alpha:
+          default: 0
+        xgb_reg_lambda:
+          default: 0
+        xgb_gamma:
+          default: 0
+        xgb_n_estimators:
+          default: 0
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+- model: lgb
+  operations:
+    hyperopt:
+      description: "hyper-parameter tuning of LightGBM baseline"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: LightGBM
+        n_random_seeds:
+          default: 1
+        lgb_num_leaves:
+          choices: [5, 10, 20, 30, 40, 50]
+          type: int
+        lgb_max_depth:
+          choices:
+            [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
+          type: int
+        lgb_learning_rate:
+          choices: [1e-3, 1e-2, 1e-1, 1.0]
+          type: float
+        lgb_n_estimators:
+          choices: [50, 100, 200, 500, 1000, 1500, 2000]
+          type: int
+        lgb_min_child_weight:
+          choices: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 1e1, 1e2, 1e3, 1e4]
+          type: float
+        lgb_subsample:
+          choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
+          type: float
+        lgb_colsample_bytree:
+          choices: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
+          type: float
+        lgb_reg_alpha:
+          choices: [0, 1e-1, 1, 2, 5, 7, 10, 50, 100]
+        lgb_reg_lambda:
+          choices: [0, 1e-1, 1, 2, 5, 7, 10, 50, 100]
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
+
+    train:
+      description: "run LightGBM model with optimal hyperparams"
+      main: run_baseline
+      flags:
+        $include: shared-flags
+        model_name: LightGBM
+        n_random_seeds:
+          default: 5
+        lgb_num_leaves:
+          default: 0
+        lgb_max_depth:
+          default: 0
+          type: int
+        lgb_learning_rate:
+          default: 0
+        lgb_n_estimators:
+          default: 0
+        lgb_min_child_weight:
+          default: 0
+        lgb_subsample:
+          default: 0
+        lgb_colsample_bytree:
+          default: 0
+          type: float
+        lgb_reg_alpha:
+          default: 0
+        lgb_reg_lambda:
+          default: 0
+
+      output-scalars:
+        - step: '\|  step: (\step)'
+        - '\|  (\key): (\value)'
diff --git a/experiments/ddxplus/run_baseline.py b/experiments/ddxplus/run_baseline.py
new file mode 100644
index 0000000..c636906
--- /dev/null
+++ b/experiments/ddxplus/run_baseline.py
@@ -0,0 +1,218 @@
+import warnings
+from collections import defaultdict
+
+import numpy as np
+from lightgbm import LGBMClassifier
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.exceptions import ConvergenceWarning
+from sklearn.linear_model import LogisticRegression
+from sklearn.multioutput import MultiOutputClassifier
+from sklearn.preprocessing import MultiLabelBinarizer
+from utils import get_scores
+from xgboost import XGBClassifier
+
+from origami.preprocessing import load_df_from_mongodb
+from origami.utils.guild import load_secrets, print_guild_scalars
+
+# experiment flags
+model_name = "LogisticRegression"  # "XGBoost" # "RandomForest"
+limit = 1000
+n_random_seeds = 5
+
+print(f"Running {model_name=}, {limit=}, {n_random_seeds=}")
+
+# defaul model hyper parameters
+
+# logistic regression
+lr_C = 1.0
+lr_penalty = "none"
+lr_max_iter = 50
+lr_fit_intercept = True
+
+# xgboost
+xgb_learning_rate = 0.1
+xgb_max_depth = 5
+xgb_subsample = 1.0
+xgb_colsample_bytree = 1.0
+xgb_colsample_bylevel = 1.0
+xgb_min_child_weight = 1.0
+xgb_reg_alpha = 0.0
+xgb_reg_lambda = 1.0
+xgb_gamma = 0
+xgb_n_estimators = 100
+
+# random forest
+rf_n_estimators = 100
+rf_max_features = "none"
+rf_max_depth = "none"
+rf_min_samples_split = 5
+
+# lightgbm
+lgb_num_leaves = 10
+lgb_max_depth = 5
+lgb_learning_rate = 0.1
+lgb_n_estimators = 100
+lgb_min_child_weight = 1.0
+lgb_subsample = 0.8
+lgb_colsample_bytree = 0.8
+lgb_reg_alpha = 0.0
+lgb_reg_lambda = 1.0
+
+secrets = load_secrets()
+
+PROJECTION = {"_id": 0, "DIFFERENTIAL_DIAGNOSIS": 0}
+TARGET_FIELD = "DIFFERENTIAL_DIAGNOSIS_NOPROB"
+
+
+def load_docs(collection_name):
+    return load_df_from_mongodb(
+        uri=secrets["MONGO_URI"],
+        db=secrets["DATABASE"],
+        coll=collection_name,
+        projection=PROJECTION,
+        sort=[("_id", 1)],
+        limit=limit,
+    )
+
+
+def preprocess_dataset(df):
+    # pull up relevant fields at the top of the df
+    df["EVIDENCES"] = df["docs"].apply(lambda x: x["EVIDENCES"])
+    df["DIFFERENTIAL_DIAGNOSIS_NOPROB"] = df["docs"].apply(lambda x: x["DIFFERENTIAL_DIAGNOSIS_NOPROB"])
+    df["PATHOLOGY"] = df["docs"].apply(lambda x: x["PATHOLOGY"])
+    return df
+
+
+train_docs_df = load_docs(collection_name="train-noprob").pipe(preprocess_dataset)
+test_docs_df = load_docs(collection_name="test-noprob").pipe(preprocess_dataset)
+val_docs_df = load_docs(collection_name="validate-noprob").pipe(preprocess_dataset)
+
+
+def get_classifier(model_name, seed):
+    match model_name:
+        case "LogisticRegression":
+            clf = LogisticRegression(
+                random_state=seed,
+                C=lr_C if lr_penalty != "none" else 1.0,
+                penalty=lr_penalty if lr_penalty != "none" else None,
+                max_iter=lr_max_iter,
+                fit_intercept=True if lr_fit_intercept == 1 else False,
+                solver="saga",
+            )
+        case "XGBoost":
+            clf = XGBClassifier(
+                random_state=seed,
+                max_depth=xgb_max_depth,
+                learning_rate=xgb_learning_rate,
+                n_estimators=xgb_n_estimators,
+                subsample=xgb_subsample,
+                colsample_bytree=xgb_colsample_bytree,
+                colsample_bylevel=xgb_colsample_bylevel,
+                min_child_weight=xgb_min_child_weight,
+                reg_alpha=xgb_reg_alpha,
+                reg_lambda=xgb_reg_lambda,
+                gamma=xgb_gamma,
+            )
+        case "RandomForest":
+            clf = RandomForestClassifier(
+                random_state=seed,
+                n_estimators=rf_n_estimators,
+                max_features=rf_max_features if rf_max_features != "none" else None,
+                max_depth=rf_max_depth if rf_max_depth != "none" else None,
+                min_samples_split=rf_min_samples_split,
+            )
+        case "LightGBM":
+            clf = LGBMClassifier(
+                random_state=seed,
+                verbose=-1,
+                num_leaves=lgb_num_leaves,
+                max_depth=lgb_max_depth,
+                learning_rate=lgb_learning_rate,
+                n_estimators=lgb_n_estimators,
+                min_child_weight=lgb_min_child_weight,
+                subsample=lgb_subsample,
+                colsample_bytree=lgb_colsample_bytree,
+                reg_alpha=lgb_reg_alpha,
+                reg_lambda=lgb_reg_lambda,
+            )
+
+        case _:
+            raise ValueError(f"Unknown model {model_name}")
+    return clf
+
+
+# encode data
+mlb_ddx = MultiLabelBinarizer()
+mlb_evd = MultiLabelBinarizer()
+
+# train
+X_train = mlb_evd.fit_transform(train_docs_df["EVIDENCES"])
+y_train = mlb_ddx.fit_transform(train_docs_df["DIFFERENTIAL_DIAGNOSIS_NOPROB"])
+
+# val
+X_val = mlb_evd.transform(val_docs_df["EVIDENCES"])
+y_val = mlb_ddx.transform(val_docs_df["DIFFERENTIAL_DIAGNOSIS_NOPROB"])
+y_pathology_val = mlb_ddx.transform(
+    val_docs_df["PATHOLOGY"].apply(
+        lambda x: [
+            x,
+        ]
+    )
+)
+y_pathology_val = np.where(y_pathology_val > 0.5)[1]
+
+# test
+X_test = mlb_evd.transform(test_docs_df["EVIDENCES"])
+y_test = mlb_ddx.transform(test_docs_df["DIFFERENTIAL_DIAGNOSIS_NOPROB"])
+y_pathology_test = mlb_ddx.transform(
+    test_docs_df["PATHOLOGY"].apply(
+        lambda x: [
+            x,
+        ]
+    )
+)
+y_pathology_test = np.where(y_pathology_test > 0.5)[1]
+
+results = defaultdict(list)
+
+for clf_seed in range(n_random_seeds):
+    clf = get_classifier(model_name=model_name, seed=clf_seed)
+    multi_output_clf = MultiOutputClassifier(clf, n_jobs=4)
+    print(f"Training {clf}")
+
+    # train
+    with warnings.catch_warnings():
+        warnings.simplefilter(action="ignore", category=ConvergenceWarning)
+        multi_output_clf.fit(X_train, y_train)
+
+    # evaluate dev
+    y_pred_val = multi_output_clf.predict_proba(X_val)
+    y_pred_val = np.hstack([y_pred_val_i[:, 1].reshape(-1, 1) for y_pred_val_i in y_pred_val])
+
+    scores_val = get_scores(y_target=y_val, y_pred=y_pred_val, y_pathology=y_pathology_val, postfix="_val")
+    for score_name, score in scores_val.items():
+        results[score_name].append(score)
+
+    # evaluate test
+    y_pred_test = multi_output_clf.predict_proba(X_test)
+    y_pred_test = np.hstack([y_pred_test_i[:, 1].reshape(-1, 1) for y_pred_test_i in y_pred_test])
+
+    scores_test = get_scores(y_target=y_test, y_pred=y_pred_test, y_pathology=y_pathology_test, postfix="_test")
+    for score_name, score in scores_test.items():
+        results[score_name].append(score)
+
+    guild_output = {"step": clf_seed} | scores_val | scores_test
+    print_guild_scalars(**guild_output)
+
+print("\n\nAggregated metrics:")
+keys = list(results.keys())
+scalars = {}
+for key in keys:
+    scalars[f"{key}_mean"] = np.mean(results[key])
+    scalars[f"{key}_std"] = np.std(results[key])
+    scalars[f"{key}_min"] = np.min(results[key])
+    scalars[f"{key}_max"] = np.max(results[key])
+
+# print rounded scalars
+print_guild_scalars(**{k: f"{v:.4f}" for k, v in scalars.items()})
+print()
diff --git a/experiments/ddxplus/run_origami.py b/experiments/ddxplus/run_origami.py
new file mode 100644
index 0000000..9f3c5eb
--- /dev/null
+++ b/experiments/ddxplus/run_origami.py
@@ -0,0 +1,189 @@
+from types import SimpleNamespace
+
+import numpy as np
+from pymongo import MongoClient
+from sklearn.pipeline import Pipeline
+
+from origami.inference import AutoCompleter, Metrics
+from origami.model import ORIGAMI
+from origami.model.vpda import ObjectVPDA
+from origami.preprocessing import (
+    DFDataset,
+    DocPermuterPipe,
+    DocTokenizerPipe,
+    KBinsDiscretizerPipe,
+    PadTruncTokensPipe,
+    TargetFieldPipe,
+    TokenEncoderPipe,
+    UpscalerPipe,
+    load_df_from_mongodb,
+)
+from origami.utils.common import set_seed
+from origami.utils.config import ModelConfig, PositionEncodingMethod, TrainConfig
+from origami.utils.guild import load_secrets, print_guild_scalars
+
+flags = SimpleNamespace()
+
+secrets = load_secrets()
+
+set_seed(flags.seed)
+
+# load PATHOLOGY fields for test data
+client = MongoClient(secrets["MONGO_URI"])
+collection_test = client.ddxplus[f"{flags.eval_data}-semistructured"]
+
+pathologies_test = [
+    d["PATHOLOGY"] for d in collection_test.find({}, projection={"PATHOLOGY": 1}, limit=flags.limit, sort=[("_id", 1)])
+]
+
+# now load data for training and evaluation (test or validate), same sort order
+if flags.evidences == "flat":
+    PROJECTION = {"_id": 0, "PATHOLOGY": 0, "DIFFERENTIAL_DIAGNOSIS": 0, "EVIDENCES_JSON_V1": 0, "EVIDENCES_JSON_V2": 0}
+elif flags.evidences == "object":
+    PROJECTION = {"_id": 0, "PATHOLOGY": 0, "DIFFERENTIAL_DIAGNOSIS": 0, "EVIDENCES": 0, "EVIDENCES_JSON_V2": 0}
+TARGET_FIELD = "DIFFERENTIAL_DIAGNOSIS_NOPROB"
+
+train_docs_df = load_df_from_mongodb(
+    secrets["MONGO_URI"], "ddxplus", "train-semistructured", projection=PROJECTION, limit=flags.limit, sort=[("_id", 1)]
+)
+
+test_docs_df = load_df_from_mongodb(
+    secrets["MONGO_URI"],
+    "ddxplus",
+    f"{flags.eval_data}-semistructured",
+    projection=PROJECTION,
+    limit=flags.limit,
+    sort=[("_id", 1)],
+)
+
+# create train and test pipelines
+pipes = {
+    # --- train only ---
+    "upscaler": UpscalerPipe(n=2),
+    "permuter": DocPermuterPipe(),
+    # --- test only ---
+    "target": TargetFieldPipe(TARGET_FIELD),
+    # --- train and test ---
+    "discretizer": KBinsDiscretizerPipe(bins=128, threshold=128, strategy="kmeans"),
+    "tokenizer": DocTokenizerPipe(),
+    "padding": PadTruncTokensPipe(length="max"),
+    "encoder": TokenEncoderPipe(),
+}
+
+train_pipeline = Pipeline(
+    [(name, pipes[name]) for name in ("discretizer", "upscaler", "permuter", "tokenizer", "padding", "encoder")]
+)
+test_pipeline = Pipeline([(name, pipes[name]) for name in ("discretizer", "target", "tokenizer", "padding", "encoder")])
+
+# process train and test/validation data
+train_pipeline.fit(train_docs_df)
+test_pipeline.fit(test_docs_df)
+train_df = train_pipeline.transform(train_docs_df)
+test_df = test_pipeline.transform(test_docs_df)
+
+# get stateful objects
+encoder = pipes["encoder"].encoder
+block_size = pipes["padding"].length
+
+# print data stats
+print(f"len train: {len(train_df)}, len val: {len(test_df)}")
+print(f"vocab size {encoder.vocab_size}")
+print(f"block size {block_size}")
+
+# wrap in datasets
+train_dataset = DFDataset(train_df)
+test_dataset = DFDataset(test_df)
+
+# model and train configs
+model_config = ModelConfig.from_preset(flags.model_size)
+model_config.position_encoding = PositionEncodingMethod.KEY_VALUE
+model_config.vocab_size = encoder.vocab_size
+model_config.block_size = block_size
+model_config.mask_field_token_losses = True
+
+train_config = TrainConfig()
+
+vpda = ObjectVPDA(encoder)  # build VPDA without schema (only doc structure enforced)
+model = ORIGAMI(model_config, train_config, vpda=vpda)
+
+metrics = Metrics(model)
+
+
+def progress_callback(model):
+    if model.batch_num % train_config.print_every == 0:
+        print_guild_scalars(
+            step=f"{int(model.batch_num / train_config.print_every)}",
+            epoch=model.epoch_num,
+            batch_num=model.batch_num,
+            batch_dt=f"{model.batch_dt * 1000:.2f}",
+            batch_loss=f"{model.loss:.4f}",
+            lr=f"{model.learning_rate:.2e}",
+        )
+
+
+model.set_callback("on_batch_end", progress_callback)
+model.train_model(train_dataset, batches=flags.n_batches)
+model.save("gpt_checkpoint.pt")
+
+# --- evaluation ---
+
+# generation is faster on cpu
+model.device = "cpu"
+
+# optionally evaluate on a smaller subset of the test data
+# test_dataset = test_dataset.sample(n=10000)
+autocompleter = AutoCompleter(model, encoder, target_field=TARGET_FIELD, max_batch_size=5000, show_progress=False)
+completions = autocompleter.autocomplete(test_dataset, decode=True)
+
+df = test_dataset.df
+df["generated"] = completions
+df["pathology"] = np.array(pathologies_test)
+df["predicted"] = [c[TARGET_FIELD] for c in completions]
+
+
+def get_ddx_arr(ddx_arr):
+    if not isinstance(ddx_arr, list):
+        # if model doesn't predict an array, this can happen
+        # we return an empty list, which will lead to prec = rec = 0
+        return []
+
+    # likewise, if the array contains anything other than strings, return empty list
+    if not all(isinstance(x, str) for x in ddx_arr):
+        return []
+
+    if TARGET_FIELD.endswith("_NOPROB"):
+        return ddx_arr
+
+    # only return the diagnosis name, not the probability
+    return [a[0] for a in ddx_arr]
+
+
+ddr = []
+ddp = []
+gtpa_at_1 = []
+gtpa = []
+
+for i, row in df.iterrows():
+    y_true = get_ddx_arr(row["target"])
+    y_pred = get_ddx_arr(row["predicted"])
+
+    print(f"{i: 4} - {y_true} {y_pred}")
+
+    intersection = set(y_true).intersection(set(y_pred))
+    ddr.append(len(intersection) / len(y_true))
+    ddp.append(len(intersection) / len(y_pred) if len(y_pred) > 0 else 0)
+
+    # is pathology the top diagnosis?
+    gtpa_at_1.append(int(len(y_pred) > 0 and row["pathology"] == y_pred[0]))
+
+    # is pathology one of the predicted diagnoses?
+    gtpa.append(int(row["pathology"] in y_pred))
+
+ddr = np.mean(ddr)
+ddp = np.mean(ddp)
+f1 = 2 * ddr * ddp / (ddr + ddp)
+gtpa_at_1 = np.mean(gtpa_at_1)
+gtpa = np.mean(gtpa)
+
+print(f"\n Evaluation result for {flags.eval_data} dataset")
+print_guild_scalars(ddr=ddr, ddp=ddp, f1=f1, gtpa_at_1=gtpa_at_1, gtpa=gtpa)
diff --git a/experiments/ddxplus/utils.py b/experiments/ddxplus/utils.py
new file mode 100644
index 0000000..d13c69f
--- /dev/null
+++ b/experiments/ddxplus/utils.py
@@ -0,0 +1,56 @@
+from typing import Dict
+
+import numpy as np
+
+
+def get_scores(
+    y_target: np.ndarray, y_pred: np.ndarray, y_pathology: np.ndarray, postfix: str = ""
+) -> Dict[str, float]:
+    ddr = []  # ddx precision
+    ddp = []  # ddx recall
+    gtpa = []  # ground truth pathology accuracy
+    gtpa_at_1 = []
+
+    for y_target_i, y_pred_i, y_pathology_i in zip(y_target, y_pred, y_pathology):
+        y_pred_i_ix = set(np.where(y_pred_i > 0.5)[0])
+        y_target_i_ix = set(np.where(y_target_i > 0.5)[0])
+
+        # precision and recall
+        intersection = y_pred_i_ix.intersection(y_target_i_ix)
+
+        ddr.append(len(intersection) / len(y_target_i_ix))
+        if len(y_pred_i_ix) > 0:
+            ddp.append(len(intersection) / len(y_pred_i_ix))
+        else:
+            ddp.append(0)
+
+        # gtpa
+        if y_pathology_i in y_pred_i_ix:
+            gtpa.append(1)
+        else:
+            gtpa.append(0)
+
+        # gtpa @ 1
+        first_pathology_predicted = y_pred_i.argmax()
+        if y_pathology_i == first_pathology_predicted:
+            gtpa_at_1.append(1)
+        else:
+            gtpa_at_1.append(0)
+
+    recall = np.mean(ddr)
+    precision = np.mean(ddp)
+    if recall + precision <= 1e-6:
+        f1 = 0
+    else:
+        f1 = 2 * recall * precision / (recall + precision)
+
+    gtpa = np.mean(gtpa)
+    gtpa_at_1 = np.mean(gtpa_at_1)
+
+    return {
+        f"recall{postfix}": recall,
+        f"precision{postfix}": precision,
+        f"f1{postfix}": f1,
+        f"gtpa{postfix}": gtpa,
+        f"gtpa_at_1{postfix}": gtpa_at_1,
+    }
diff --git a/experiments/json2vec/.env.local b/experiments/json2vec/.env.local
new file mode 100644
index 0000000..83ce2d8
--- /dev/null
+++ b/experiments/json2vec/.env.local
@@ -0,0 +1,3 @@
+MONGO_USER=
+MONGO_PW=
+MONGO_URI="mongodb://localhost:27017"
diff --git a/experiments/prediction/README.md b/experiments/json2vec/README.md
similarity index 61%
rename from experiments/prediction/README.md
rename to experiments/json2vec/README.md
index 1f6445a..961f862 100644
--- a/experiments/prediction/README.md
+++ b/experiments/json2vec/README.md
@@ -1,73 +1,8 @@
-<!-- ## General Notes
+# json2vec Experiments
 
-### Cross-validation and train/test splits
+We compare our model against baselines (Logistic Regression, Random Forests, XGBoost, LightGBM) on the same benchmark datasets proposed in [A Framework for End-to-End Learning on Semantic Tree-Structured Data](https://arxiv.org/abs/2002.05707) by William Woof and Ke Chen. These datasets were originally taken from the UCI repository and have been converted from tabular form to JSON structure.
 
-The behaviour is controlled by the `cross_val` flag.
-
-- `cross_val=none` disables cross-validation and uses a simple train/test split
-- `cross_val=5-fold` creates 5 folds for cross-validation
-- `cross_val=catalog` uses the pre-defined split indices in the `openml.catalog` collection (only for OpenML datasets)
-
-Additional parameters:
-
-- `train.test_split` is the fraction of the test dataset when cross-validation is disabled
-- `train.shuffle_split` whether or not to shuffle rows (both for cross-validation splits and train/test splits)
-
-Some examples below:
-
-#### Single run with default train/test split
-
-Default test split is 0.2 and shuffled.
-
-```
-guild run <model>:all dataset=<dataset> cross_val=none
-```
-
-#### Single run with custom train/test split
-
-We choose a split of 60/40 and no shuffling.
-
-```
-guild run <model>:all dataset=<dataset> cross_val=none train.test_split=0.4 train.shuffle_split=no
-```
-
-#### 5-fold cross validation, unshuffled
-
-`train.test_split` is ignored.
-
-```
-guild run <model>:all dataset=<dataset> cross_val=5-fold train.shuffle_split=no
-```
-
-#### k-fold cross-validation from catalog
-
-This loads the split indices in the `openml.catalog` collection, which are stored
-under the field path `task.cross_validation`.
-
-`k` is usually 10, but may potentially differ, based on the splits defined in the `catalog` collection.
-
-`train.test_split` and `train.shuffle_split` are ignored.
-
-```
-guild run <model>:all dataset=tictactoe cross_val=catalog
-``` -->
-
-# Reproducing the results from our paper
-
-We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking.
-
-### Datasets
-
-We bundled all datasets used in the paper in a convenient [MongoDB dump file](). To reproduce the results, first
-you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance:
-
-```
-mongorestore dump/
-```
-
-This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data.
-
-If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in this directory accordingly.
+First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `json2vec` directory.
 
 ### Hyper-parameter tuning
 
@@ -107,7 +42,7 @@ guild runs info <run-id>
 To run a particular parameter configuration on a dataset, use the following command:
 
 ```
-guild run <model>:all dataset=<dataset> <param1>=<value1> <param2=value>
+guild run <model>:all dataset=<dataset> <param1>=<value1> <param2=value> ...
 ```
 
 - `<model>` is the model name, choose from `origami`, `logreg`, `rf`, `xgboost`, `lightgbm`.
diff --git a/experiments/json2vec/__init__.py b/experiments/json2vec/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/experiments/prediction/datasets/Internet-Advertisements.yml b/experiments/json2vec/datasets/Internet-Advertisements.yml
similarity index 100%
rename from experiments/prediction/datasets/Internet-Advertisements.yml
rename to experiments/json2vec/datasets/Internet-Advertisements.yml
diff --git a/experiments/prediction/datasets/adult.yml b/experiments/json2vec/datasets/adult.yml
similarity index 100%
rename from experiments/prediction/datasets/adult.yml
rename to experiments/json2vec/datasets/adult.yml
diff --git a/experiments/prediction/datasets/bank-marketing.yml b/experiments/json2vec/datasets/bank-marketing.yml
similarity index 100%
rename from experiments/prediction/datasets/bank-marketing.yml
rename to experiments/json2vec/datasets/bank-marketing.yml
diff --git a/experiments/prediction/datasets/car.yml b/experiments/json2vec/datasets/car.yml
similarity index 100%
rename from experiments/prediction/datasets/car.yml
rename to experiments/json2vec/datasets/car.yml
diff --git a/experiments/prediction/datasets/cmc.yml b/experiments/json2vec/datasets/cmc.yml
similarity index 100%
rename from experiments/prediction/datasets/cmc.yml
rename to experiments/json2vec/datasets/cmc.yml
diff --git a/experiments/prediction/datasets/connect-4.yml b/experiments/json2vec/datasets/connect-4.yml
similarity index 100%
rename from experiments/prediction/datasets/connect-4.yml
rename to experiments/json2vec/datasets/connect-4.yml
diff --git a/experiments/prediction/datasets/cylinder-bands.yml b/experiments/json2vec/datasets/cylinder-bands.yml
similarity index 100%
rename from experiments/prediction/datasets/cylinder-bands.yml
rename to experiments/json2vec/datasets/cylinder-bands.yml
diff --git a/experiments/prediction/datasets/ddxplus-json-v1.yml b/experiments/json2vec/datasets/ddxplus-json-v1.yml
similarity index 100%
rename from experiments/prediction/datasets/ddxplus-json-v1.yml
rename to experiments/json2vec/datasets/ddxplus-json-v1.yml
diff --git a/experiments/prediction/datasets/ddxplus-json-v2.yml b/experiments/json2vec/datasets/ddxplus-json-v2.yml
similarity index 100%
rename from experiments/prediction/datasets/ddxplus-json-v2.yml
rename to experiments/json2vec/datasets/ddxplus-json-v2.yml
diff --git a/experiments/prediction/datasets/ddxplus-raw.yml b/experiments/json2vec/datasets/ddxplus-raw.yml
similarity index 100%
rename from experiments/prediction/datasets/ddxplus-raw.yml
rename to experiments/json2vec/datasets/ddxplus-raw.yml
diff --git a/experiments/prediction/datasets/dna.yml b/experiments/json2vec/datasets/dna.yml
similarity index 100%
rename from experiments/prediction/datasets/dna.yml
rename to experiments/json2vec/datasets/dna.yml
diff --git a/experiments/prediction/datasets/dresses-sales.yml b/experiments/json2vec/datasets/dresses-sales.yml
similarity index 100%
rename from experiments/prediction/datasets/dresses-sales.yml
rename to experiments/json2vec/datasets/dresses-sales.yml
diff --git a/experiments/prediction/datasets/dungeons-mk.yml b/experiments/json2vec/datasets/dungeons-mk.yml
similarity index 100%
rename from experiments/prediction/datasets/dungeons-mk.yml
rename to experiments/json2vec/datasets/dungeons-mk.yml
diff --git a/experiments/prediction/datasets/dungeons-rkm.yml b/experiments/json2vec/datasets/dungeons-rkm.yml
similarity index 100%
rename from experiments/prediction/datasets/dungeons-rkm.yml
rename to experiments/json2vec/datasets/dungeons-rkm.yml
diff --git a/experiments/prediction/datasets/electricity.yml b/experiments/json2vec/datasets/electricity.yml
similarity index 100%
rename from experiments/prediction/datasets/electricity.yml
rename to experiments/json2vec/datasets/electricity.yml
diff --git a/experiments/prediction/datasets/json2vec-automobile.yml b/experiments/json2vec/datasets/json2vec-automobile.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-automobile.yml
rename to experiments/json2vec/datasets/json2vec-automobile.yml
diff --git a/experiments/prediction/datasets/json2vec-bank.yml b/experiments/json2vec/datasets/json2vec-bank.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-bank.yml
rename to experiments/json2vec/datasets/json2vec-bank.yml
diff --git a/experiments/prediction/datasets/json2vec-car.yml b/experiments/json2vec/datasets/json2vec-car.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-car.yml
rename to experiments/json2vec/datasets/json2vec-car.yml
diff --git a/experiments/prediction/datasets/json2vec-contraceptive.yml b/experiments/json2vec/datasets/json2vec-contraceptive.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-contraceptive.yml
rename to experiments/json2vec/datasets/json2vec-contraceptive.yml
diff --git a/experiments/prediction/datasets/json2vec-mushroom.yml b/experiments/json2vec/datasets/json2vec-mushroom.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-mushroom.yml
rename to experiments/json2vec/datasets/json2vec-mushroom.yml
diff --git a/experiments/prediction/datasets/json2vec-nursery.yml b/experiments/json2vec/datasets/json2vec-nursery.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-nursery.yml
rename to experiments/json2vec/datasets/json2vec-nursery.yml
diff --git a/experiments/prediction/datasets/json2vec-seismic.yml b/experiments/json2vec/datasets/json2vec-seismic.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-seismic.yml
rename to experiments/json2vec/datasets/json2vec-seismic.yml
diff --git a/experiments/prediction/datasets/json2vec-student.yml b/experiments/json2vec/datasets/json2vec-student.yml
similarity index 100%
rename from experiments/prediction/datasets/json2vec-student.yml
rename to experiments/json2vec/datasets/json2vec-student.yml
diff --git a/experiments/prediction/datasets/jungle_chess_2pcs_raw_endgame_complete.yml b/experiments/json2vec/datasets/jungle_chess_2pcs_raw_endgame_complete.yml
similarity index 100%
rename from experiments/prediction/datasets/jungle_chess_2pcs_raw_endgame_complete.yml
rename to experiments/json2vec/datasets/jungle_chess_2pcs_raw_endgame_complete.yml
diff --git a/experiments/prediction/datasets/kr-vs-kp.yml b/experiments/json2vec/datasets/kr-vs-kp.yml
similarity index 100%
rename from experiments/prediction/datasets/kr-vs-kp.yml
rename to experiments/json2vec/datasets/kr-vs-kp.yml
diff --git a/experiments/prediction/datasets/letter.yml b/experiments/json2vec/datasets/letter.yml
similarity index 100%
rename from experiments/prediction/datasets/letter.yml
rename to experiments/json2vec/datasets/letter.yml
diff --git a/experiments/prediction/datasets/movies.yml b/experiments/json2vec/datasets/movies.yml
similarity index 100%
rename from experiments/prediction/datasets/movies.yml
rename to experiments/json2vec/datasets/movies.yml
diff --git a/experiments/prediction/datasets/mutagenesis.yml b/experiments/json2vec/datasets/mutagenesis.yml
similarity index 100%
rename from experiments/prediction/datasets/mutagenesis.yml
rename to experiments/json2vec/datasets/mutagenesis.yml
diff --git a/experiments/prediction/datasets/optdigits.yml b/experiments/json2vec/datasets/optdigits.yml
similarity index 100%
rename from experiments/prediction/datasets/optdigits.yml
rename to experiments/json2vec/datasets/optdigits.yml
diff --git a/experiments/prediction/datasets/phishing.yml b/experiments/json2vec/datasets/phishing.yml
similarity index 100%
rename from experiments/prediction/datasets/phishing.yml
rename to experiments/json2vec/datasets/phishing.yml
diff --git a/experiments/prediction/datasets/semeion.yml b/experiments/json2vec/datasets/semeion.yml
similarity index 100%
rename from experiments/prediction/datasets/semeion.yml
rename to experiments/json2vec/datasets/semeion.yml
diff --git a/experiments/prediction/datasets/sick.yml b/experiments/json2vec/datasets/sick.yml
similarity index 100%
rename from experiments/prediction/datasets/sick.yml
rename to experiments/json2vec/datasets/sick.yml
diff --git a/experiments/prediction/datasets/splice.yml b/experiments/json2vec/datasets/splice.yml
similarity index 100%
rename from experiments/prediction/datasets/splice.yml
rename to experiments/json2vec/datasets/splice.yml
diff --git a/experiments/prediction/datasets/tictactoe.yml b/experiments/json2vec/datasets/tictactoe.yml
similarity index 100%
rename from experiments/prediction/datasets/tictactoe.yml
rename to experiments/json2vec/datasets/tictactoe.yml
diff --git a/experiments/prediction/flags.yml b/experiments/json2vec/flags.yml
similarity index 100%
rename from experiments/prediction/flags.yml
rename to experiments/json2vec/flags.yml
diff --git a/experiments/prediction/guild.yml b/experiments/json2vec/guild.yml
similarity index 100%
rename from experiments/prediction/guild.yml
rename to experiments/json2vec/guild.yml
diff --git a/experiments/prediction/run_baseline.py b/experiments/json2vec/run_baseline.py
similarity index 100%
rename from experiments/prediction/run_baseline.py
rename to experiments/json2vec/run_baseline.py
diff --git a/experiments/prediction/run_lightgbm.py b/experiments/json2vec/run_lightgbm.py
similarity index 100%
rename from experiments/prediction/run_lightgbm.py
rename to experiments/json2vec/run_lightgbm.py
diff --git a/experiments/prediction/run_logreg.py b/experiments/json2vec/run_logreg.py
similarity index 100%
rename from experiments/prediction/run_logreg.py
rename to experiments/json2vec/run_logreg.py
diff --git a/experiments/prediction/run_origami.py b/experiments/json2vec/run_origami.py
similarity index 100%
rename from experiments/prediction/run_origami.py
rename to experiments/json2vec/run_origami.py
diff --git a/experiments/prediction/run_rf.py b/experiments/json2vec/run_rf.py
similarity index 100%
rename from experiments/prediction/run_rf.py
rename to experiments/json2vec/run_rf.py
diff --git a/experiments/prediction/run_xgboost.py b/experiments/json2vec/run_xgboost.py
similarity index 100%
rename from experiments/prediction/run_xgboost.py
rename to experiments/json2vec/run_xgboost.py
diff --git a/experiments/prediction/runner.py b/experiments/json2vec/runner.py
similarity index 100%
rename from experiments/prediction/runner.py
rename to experiments/json2vec/runner.py
diff --git a/experiments/prediction/utils.py b/experiments/json2vec/utils.py
similarity index 100%
rename from experiments/prediction/utils.py
rename to experiments/json2vec/utils.py
diff --git a/requirements.txt b/requirements.txt
index c30c386..3a5958e 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,9 @@
 click==8.1.7
 click-option-group==0.5.6
 guildai==0.9.0
+jupyter==1.1.1
+jupyter_contrib_nbextensions==0.7.0
+lightgbm==4.5.0
 matplotlib==3.9.2
 mdbrtools==0.1.1
 numpy==1.26.4
@@ -12,4 +15,5 @@ pytest==8.3.3
 ruff==0.9.3
 scikit_learn==1.5.2
 torch==2.4.1
-tqdm==4.66.4
\ No newline at end of file
+tqdm==4.66.4
+xgboost==2.1.3
\ No newline at end of file