Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions experiments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Reproducing the results from our paper

This directory contains the code and instructions to reproduce the experiments from our paper:
[ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348).

There are 3 sub-directories, each with their own `README.md` file:

- [`json2vec`](./json2vec/README.md) contains the experiments from section 3.1, where we compare on standard tabular benchmarks that have been converted to JSON against various baselines and the json2vec models from [A Framework for End-to-End Learning on Semantic Tree-Structured Data](https://arxiv.org/abs/2002.05707) by William Woof and Ke Chen.
- [`ddxplus`](./ddxplus/README.md) contains the experiments from section 3.2 for a medical diagnosis task on patient information. This experiment demonstrates prediction of multi-token values representing arrays of possible pathologies.
- [`codenet`](./codenet/README.md) contains the experiments from section 3.3 related to a Java code classification task. Here we demonstrate the model's ability to deal with complex and deeply nested JSON objects.

### Experiment Tracking

We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking.

### Datasets

We bundled all datasets used in the paper in a [MongoDB dump file](). To reproduce the results, first
you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance:

```
mongorestore dump/
```

This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data.

If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in each sub-directory accordingly.
File renamed without changes.
20 changes: 20 additions & 0 deletions experiments/codenet/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# CodeNet Java Experiments

In this experiment, we convert Java code snippets from the [CodeNet](https://developer.ibm.com/exchanges/data/all/project-codenet/) dataset into Abstract Syntax Trees and store them as JSON objects.
We then train an ORiGAMi model on these ASTs for a classification task, where the programming problem ID is the target label. More details on the dataset and classification task can be found
in the paper [CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks](https://arxiv.org/abs/2105.12655) by Ruchir Puri et al.

First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `codenet` directory.

### Training and evaluating the model

Due to resource constraints, we did not perform a hyperparameter optimization. We use a model with 4 transformer layers, 4 heads and 192 embedding dimensionality. All parameters are
configured as defaults in the `guild.yml` file.

To run the training and evaluation on the test set, use:

```bash
guild run train
```

Note: Training with the default parameters requires est. 50 GB of GPU RAM.
18 changes: 18 additions & 0 deletions experiments/codenet/guild.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
train:
description: Train a model on the codenet Java dataset
main: train
flags-dest: namespace:flags
flags:
n_batches: 200000
n_problems: 250
batch_size: 8
learning_rate: 1e-3
n_embd: 192
max_tokens: 4000
max_length: 4000
eval_every: 1000

# matches the guild_output_scalars() helper function
output-scalars:
- step: '\| step: (\step)'
- '\| (\key): (\value)'
201 changes: 201 additions & 0 deletions experiments/codenet/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
from pathlib import Path
from types import SimpleNamespace

from pymongo import MongoClient
from sklearn.pipeline import Pipeline

from origami.inference import Predictor
from origami.model import ORIGAMI
from origami.model.vpda import ObjectVPDA
from origami.preprocessing import (
DFDataset,
DocPermuterPipe,
DocTokenizerPipe,
PadTruncTokensPipe,
TargetFieldPipe,
TokenEncoderPipe,
UpscalerPipe,
load_df_from_mongodb,
)
from origami.utils.common import set_seed
from origami.utils.config import GuardrailsMethod, ModelConfig, PositionEncodingMethod, TrainConfig
from origami.utils.guild import load_secrets, print_guild_scalars

# populated by guild
flags = SimpleNamespace()
secrets = load_secrets()

# for reproducibility
set_seed(1234)

TARGET_FIELD = "problem"
UPSCALE = 2

client = MongoClient(secrets["MONGO_URI"])
collection = client["codenet_java"].train

target_problems = collection.distinct(TARGET_FIELD)
num_problems = len(target_problems)

target_problems = target_problems[: flags.n_problems]
print(f"training on {flags.n_problems} problems (out of {num_problems})")

# load data into dataframe for train/test

train_docs_df = load_df_from_mongodb(
"mongodb://localhost:27017",
"codenet_java",
"train",
filter={"problem": {"$in": target_problems}},
projection={"_id": 0, "filePath": 0},
)

test_docs_df = load_df_from_mongodb(
"mongodb://localhost:27017",
"codenet_java",
"test",
filter={"problem": {"$in": target_problems}},
projection={"_id": 0, "filePath": 0},
)

num_train_inst = len(train_docs_df)
num_test_inst = len(test_docs_df)

# create train and test pipelines
pipes = {
# --- train only ---
"upscaler": UpscalerPipe(n=UPSCALE),
"permuter": DocPermuterPipe(shuffle_arrays=True),
# --- test only ---
"target": TargetFieldPipe(TARGET_FIELD),
# --- train and test ---
"tokenizer": DocTokenizerPipe(path_in_field_tokens=False),
"padding": PadTruncTokensPipe(length=flags.max_length),
"encoder": TokenEncoderPipe(max_tokens=flags.max_tokens),
}

train_pipeline = Pipeline(
[(name, pipes[name]) for name in ("target", "upscaler", "permuter", "tokenizer", "padding", "encoder")],
verbose=True,
)
test_pipeline = Pipeline([(name, pipes[name]) for name in ("target", "tokenizer", "padding", "encoder")], verbose=True)

# process train, eval and test data (first fit both, then transform)
train_pipeline.fit(train_docs_df)
test_pipeline.fit(test_docs_df)

train_df = train_pipeline.transform(train_docs_df)
test_df = test_pipeline.transform(test_docs_df)

# drop ordered_docs columns to save space
train_df.drop(columns=["docs"], inplace=True)
test_df.drop(columns=["docs"], inplace=True)

# drop all rows where the tokens array doesn't end in 0 (longer than max_length)
train_df = train_df[train_df["tokens"].apply(lambda x: x[-1] == 0)]
test_df = test_df[test_df["tokens"].apply(lambda x: x[-1] == 0)]

# get stateful objects
encoder = pipes["encoder"].encoder
block_size = pipes["padding"].length

# print data stats
print(
f"dropped {(1 - (len(train_df) / (UPSCALE * num_train_inst))) * 100:.2f}% training instances, and "
f"{(1 - (len(test_df) / num_test_inst)) * 100:.2f}% test instances."
)
print(f"vocab size {encoder.vocab_size}")
print(f"block size {block_size}")

# confirm that all targets are in the vocabulary
for target in train_df["target"].unique():
enc = encoder.encode(target)
assert target == encoder.decode(enc), f"token not {target} represented in vocab."

for target in test_df["target"].unique():
enc = encoder.encode(target)
assert target == encoder.decode(enc), f"token not {target} represented in vocab."

# create datasets, VPDA and model

# model and train configs
model_config = ModelConfig.from_preset("small")
model_config.position_encoding = PositionEncodingMethod.KEY_VALUE
model_config.vocab_size = encoder.vocab_size
model_config.block_size = block_size
model_config.n_embd = flags.n_embd
model_config.mask_field_token_losses = False
model_config.tie_weights = False
model_config.guardrails = GuardrailsMethod.STRUCTURE_ONLY
model_config.fuse_pos_with_mlp = True

train_config = TrainConfig()
train_config.learning_rate = flags.learning_rate
train_config.batch_size = flags.batch_size
train_config.n_warmup_batches = 100
train_config.eval_every = flags.eval_every

# datasets
train_dataset = DFDataset(train_df)
test_dataset = DFDataset(test_df)

vpda = ObjectVPDA(encoder)
model = ORIGAMI(model_config, train_config, vpda=vpda)

# load model checkpoint if it exists
checkpoint_file = Path("./gpt-codenet-snapshot.pt")
if checkpoint_file.is_file():
model.load("gpt-codenet-snapshot.pt")
print(f"loading existing checkpoint at batch_num {model.batch_num}...")


# create a predictor
predictor = Predictor(model, encoder, TARGET_FIELD)


def progress_callback(model):
print_guild_scalars(
step=f"{int(model.batch_num)}",
epoch=model.epoch_num,
batch_num=model.batch_num,
batch_dt=f"{model.batch_dt * 1000:.2f}",
batch_loss=f"{model.loss:.4f}",
lr=f"{model.learning_rate:.2e}",
)
if model.batch_num % train_config.eval_every == 0:
try:
# train_acc = predictor.accuracy(train_dataset.sample(n=100))
test_acc = predictor.accuracy(test_dataset.sample(n=100), show_progress=True)
print_guild_scalars(
step=f"{int(model.batch_num)}",
# train_acc=f"{train_acc:.4f}",
test_acc=f"{test_acc:.4f}",
)
# print(f"Train accuracy @ 100: {train_acc:.4f}, Test accuracy @ 100: {test_acc:.4f}")
except AssertionError as e:
print(e)
print("continuing...")

model.save("gpt-codenet-snapshot.pt")
print("model saved to gpt-codenet-snapshot.pt")


model.set_callback("on_batch_end", progress_callback)

try:
model.train_model(train_dataset, batches=flags.n_batches)
except KeyboardInterrupt:
pass

# final save
model.save("gpt-codenet-snapshot.pt")
print("model saved to gpt-codenet-snapshot.pt")

test_acc = predictor.accuracy(test_dataset, show_progress=True)
print_guild_scalars(
step=f"{int(model.batch_num / train_config.eval_every)}",
test_acc=f"{test_acc:.4f}",
)

dropped_ratio = 1 - (len(test_df) / num_test_inst)
print(f"Final test accuracy when taking into account the dropped instances: {(1 - dropped_ratio) * test_acc:.4f}%")
2 changes: 2 additions & 0 deletions experiments/ddxplus/.env.local
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
MONGO_URI="mongodb://localhost:27017"
DATABASE=ddxplus
65 changes: 65 additions & 0 deletions experiments/ddxplus/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# DDXPlus Experiments

In this experiment we train the model on the [DDXPlus dataset](https://arxiv.org/abs/2205.09148), a dataset for automated medical diagnosis. We devise a task to predict the most likely differential diagnoses for each instance, a multi-label prediction task.

For ORiGAMi, we reformat the dataset into JSON format with two different representations:

- A flat representation, in which we store the evidences and their values as strings.
- An object representation, where the evidences are stored as object containing array values.

We compare our model against baselines: Logistic Regression, Random Forests, XGBoost, LightGBM. The baselines are trained on a
flat representation by converting the evidence-value strings into a multi-label binary matrix. We wrap each model in a scikit-learn
[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html).

First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `ddxplus` directory.

## ORiGAMi

We train a model with the `medium` size preset by default: 6 layers, 6 heads, 192 embedding dimensionality. To train with other model sizes, append `model_size=<size>` to the command, using one of the following options: `xs`, `small`, `medium`, `large`, `xl`.

To train and evaluate ORiGAMi on the flat evidences structure, run the following:

```bash
guild run origami:train evidences=flat eval_data=test seed="[1, 2, 3, 4, 5]"
```

For the object representation of evidences, run instead:

```bash
guild run origami:train evidences=object eval_data=test seed="[1, 2, 3, 4, 5]"
```

This will repeat the training and evaluation 5 times with different random seeds and evaluate on the test set.

## Baselines

### Hyperparameter optimization

First perform HPO, supplying the `<model>` as one of `lr` (Logistic Regression), `rf` (Random Forest), `xgb` (XGBoost), `lgb` (LightGBM) and the appropriate number of trial runs with `--max-trials <num>`, and give the run a name with `<label>`, e.g.

```bash
NUMPY_EXPERIMENTAL_DTYPE_API=1 guild run lr:hyperopt --optimizer random --max-trials 20 --label <label>
```

To find the best parameters on the validation dataset, use:

```bash
guild compare -Fl <label> -u
```

Sort the `f1_val_mean` column in descending order (press `S` key) and pick the run ID (first column) of the best configuration.

Get the hyperparameters (= flags) with `guild runs info <run-id>`.

### Evaluate best hyperparameters on test dataset

Once the optimal hyperparameters are found, run the model with the optimal hyperparameters, e.g.:

```bash
guild run lr:train <param1>=<value1> <param2=value2> ...
```

Replace the `<param>` and `<value>` placeholders with the optimal hyperparameters. You can ignore `model_name` and `n_random_seeds` here.
By default, the evaluation is done 5 times with different random seeds.

The `<metric>_test_mean` and `<metric_test_val>` scores show the evaluation on the test dataset, where `<metric>` is one of `f1`, `precision`, `recall`.
File renamed without changes.
Loading