diff --git a/experiments/README.md b/experiments/README.md new file mode 100644 index 0000000..4646b19 --- /dev/null +++ b/experiments/README.md @@ -0,0 +1,27 @@ +# Reproducing the results from our paper + +This directory contains the code and instructions to reproduce the experiments from our paper: +[ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348). + +There are 3 sub-directories, each with their own `README.md` file: + +- [`json2vec`](./json2vec/README.md) contains the experiments from section 3.1, where we compare on standard tabular benchmarks that have been converted to JSON against various baselines and the json2vec models from [A Framework for End-to-End Learning on Semantic Tree-Structured Data](https://arxiv.org/abs/2002.05707) by William Woof and Ke Chen. +- [`ddxplus`](./ddxplus/README.md) contains the experiments from section 3.2 for a medical diagnosis task on patient information. This experiment demonstrates prediction of multi-token values representing arrays of possible pathologies. +- [`codenet`](./codenet/README.md) contains the experiments from section 3.3 related to a Java code classification task. Here we demonstrate the model's ability to deal with complex and deeply nested JSON objects. + +### Experiment Tracking + +We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking. + +### Datasets + +We bundled all datasets used in the paper in a [MongoDB dump file](). To reproduce the results, first +you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance: + +``` +mongorestore dump/ +``` + +This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data. + +If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in each sub-directory accordingly. diff --git a/experiments/prediction/.env.local b/experiments/codenet/.env.local similarity index 100% rename from experiments/prediction/.env.local rename to experiments/codenet/.env.local diff --git a/experiments/codenet/README.md b/experiments/codenet/README.md new file mode 100644 index 0000000..305e8f2 --- /dev/null +++ b/experiments/codenet/README.md @@ -0,0 +1,20 @@ +# CodeNet Java Experiments + +In this experiment, we convert Java code snippets from the [CodeNet](https://developer.ibm.com/exchanges/data/all/project-codenet/) dataset into Abstract Syntax Trees and store them as JSON objects. +We then train an ORiGAMi model on these ASTs for a classification task, where the programming problem ID is the target label. More details on the dataset and classification task can be found +in the paper [CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks](https://arxiv.org/abs/2105.12655) by Ruchir Puri et al. + +First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `codenet` directory. + +### Training and evaluating the model + +Due to resource constraints, we did not perform a hyperparameter optimization. We use a model with 4 transformer layers, 4 heads and 192 embedding dimensionality. All parameters are +configured as defaults in the `guild.yml` file. + +To run the training and evaluation on the test set, use: + +```bash +guild run train +``` + +Note: Training with the default parameters requires est. 50 GB of GPU RAM. diff --git a/experiments/codenet/guild.yml b/experiments/codenet/guild.yml new file mode 100644 index 0000000..3ad5fe3 --- /dev/null +++ b/experiments/codenet/guild.yml @@ -0,0 +1,18 @@ +train: + description: Train a model on the codenet Java dataset + main: train + flags-dest: namespace:flags + flags: + n_batches: 200000 + n_problems: 250 + batch_size: 8 + learning_rate: 1e-3 + n_embd: 192 + max_tokens: 4000 + max_length: 4000 + eval_every: 1000 + + # matches the guild_output_scalars() helper function + output-scalars: + - step: '\| step: (\step)' + - '\| (\key): (\value)' diff --git a/experiments/codenet/train.py b/experiments/codenet/train.py new file mode 100644 index 0000000..a40fe4d --- /dev/null +++ b/experiments/codenet/train.py @@ -0,0 +1,201 @@ +from pathlib import Path +from types import SimpleNamespace + +from pymongo import MongoClient +from sklearn.pipeline import Pipeline + +from origami.inference import Predictor +from origami.model import ORIGAMI +from origami.model.vpda import ObjectVPDA +from origami.preprocessing import ( + DFDataset, + DocPermuterPipe, + DocTokenizerPipe, + PadTruncTokensPipe, + TargetFieldPipe, + TokenEncoderPipe, + UpscalerPipe, + load_df_from_mongodb, +) +from origami.utils.common import set_seed +from origami.utils.config import GuardrailsMethod, ModelConfig, PositionEncodingMethod, TrainConfig +from origami.utils.guild import load_secrets, print_guild_scalars + +# populated by guild +flags = SimpleNamespace() +secrets = load_secrets() + +# for reproducibility +set_seed(1234) + +TARGET_FIELD = "problem" +UPSCALE = 2 + +client = MongoClient(secrets["MONGO_URI"]) +collection = client["codenet_java"].train + +target_problems = collection.distinct(TARGET_FIELD) +num_problems = len(target_problems) + +target_problems = target_problems[: flags.n_problems] +print(f"training on {flags.n_problems} problems (out of {num_problems})") + +# load data into dataframe for train/test + +train_docs_df = load_df_from_mongodb( + "mongodb://localhost:27017", + "codenet_java", + "train", + filter={"problem": {"$in": target_problems}}, + projection={"_id": 0, "filePath": 0}, +) + +test_docs_df = load_df_from_mongodb( + "mongodb://localhost:27017", + "codenet_java", + "test", + filter={"problem": {"$in": target_problems}}, + projection={"_id": 0, "filePath": 0}, +) + +num_train_inst = len(train_docs_df) +num_test_inst = len(test_docs_df) + +# create train and test pipelines +pipes = { + # --- train only --- + "upscaler": UpscalerPipe(n=UPSCALE), + "permuter": DocPermuterPipe(shuffle_arrays=True), + # --- test only --- + "target": TargetFieldPipe(TARGET_FIELD), + # --- train and test --- + "tokenizer": DocTokenizerPipe(path_in_field_tokens=False), + "padding": PadTruncTokensPipe(length=flags.max_length), + "encoder": TokenEncoderPipe(max_tokens=flags.max_tokens), +} + +train_pipeline = Pipeline( + [(name, pipes[name]) for name in ("target", "upscaler", "permuter", "tokenizer", "padding", "encoder")], + verbose=True, +) +test_pipeline = Pipeline([(name, pipes[name]) for name in ("target", "tokenizer", "padding", "encoder")], verbose=True) + +# process train, eval and test data (first fit both, then transform) +train_pipeline.fit(train_docs_df) +test_pipeline.fit(test_docs_df) + +train_df = train_pipeline.transform(train_docs_df) +test_df = test_pipeline.transform(test_docs_df) + +# drop ordered_docs columns to save space +train_df.drop(columns=["docs"], inplace=True) +test_df.drop(columns=["docs"], inplace=True) + +# drop all rows where the tokens array doesn't end in 0 (longer than max_length) +train_df = train_df[train_df["tokens"].apply(lambda x: x[-1] == 0)] +test_df = test_df[test_df["tokens"].apply(lambda x: x[-1] == 0)] + +# get stateful objects +encoder = pipes["encoder"].encoder +block_size = pipes["padding"].length + +# print data stats +print( + f"dropped {(1 - (len(train_df) / (UPSCALE * num_train_inst))) * 100:.2f}% training instances, and " + f"{(1 - (len(test_df) / num_test_inst)) * 100:.2f}% test instances." +) +print(f"vocab size {encoder.vocab_size}") +print(f"block size {block_size}") + +# confirm that all targets are in the vocabulary +for target in train_df["target"].unique(): + enc = encoder.encode(target) + assert target == encoder.decode(enc), f"token not {target} represented in vocab." + +for target in test_df["target"].unique(): + enc = encoder.encode(target) + assert target == encoder.decode(enc), f"token not {target} represented in vocab." + +# create datasets, VPDA and model + +# model and train configs +model_config = ModelConfig.from_preset("small") +model_config.position_encoding = PositionEncodingMethod.KEY_VALUE +model_config.vocab_size = encoder.vocab_size +model_config.block_size = block_size +model_config.n_embd = flags.n_embd +model_config.mask_field_token_losses = False +model_config.tie_weights = False +model_config.guardrails = GuardrailsMethod.STRUCTURE_ONLY +model_config.fuse_pos_with_mlp = True + +train_config = TrainConfig() +train_config.learning_rate = flags.learning_rate +train_config.batch_size = flags.batch_size +train_config.n_warmup_batches = 100 +train_config.eval_every = flags.eval_every + +# datasets +train_dataset = DFDataset(train_df) +test_dataset = DFDataset(test_df) + +vpda = ObjectVPDA(encoder) +model = ORIGAMI(model_config, train_config, vpda=vpda) + +# load model checkpoint if it exists +checkpoint_file = Path("./gpt-codenet-snapshot.pt") +if checkpoint_file.is_file(): + model.load("gpt-codenet-snapshot.pt") + print(f"loading existing checkpoint at batch_num {model.batch_num}...") + + +# create a predictor +predictor = Predictor(model, encoder, TARGET_FIELD) + + +def progress_callback(model): + print_guild_scalars( + step=f"{int(model.batch_num)}", + epoch=model.epoch_num, + batch_num=model.batch_num, + batch_dt=f"{model.batch_dt * 1000:.2f}", + batch_loss=f"{model.loss:.4f}", + lr=f"{model.learning_rate:.2e}", + ) + if model.batch_num % train_config.eval_every == 0: + try: + # train_acc = predictor.accuracy(train_dataset.sample(n=100)) + test_acc = predictor.accuracy(test_dataset.sample(n=100), show_progress=True) + print_guild_scalars( + step=f"{int(model.batch_num)}", + # train_acc=f"{train_acc:.4f}", + test_acc=f"{test_acc:.4f}", + ) + # print(f"Train accuracy @ 100: {train_acc:.4f}, Test accuracy @ 100: {test_acc:.4f}") + except AssertionError as e: + print(e) + print("continuing...") + + model.save("gpt-codenet-snapshot.pt") + print("model saved to gpt-codenet-snapshot.pt") + + +model.set_callback("on_batch_end", progress_callback) + +try: + model.train_model(train_dataset, batches=flags.n_batches) +except KeyboardInterrupt: + pass + +# final save +model.save("gpt-codenet-snapshot.pt") +print("model saved to gpt-codenet-snapshot.pt") + +test_acc = predictor.accuracy(test_dataset, show_progress=True) +print_guild_scalars( + step=f"{int(model.batch_num / train_config.eval_every)}", + test_acc=f"{test_acc:.4f}", +) + +dropped_ratio = 1 - (len(test_df) / num_test_inst) +print(f"Final test accuracy when taking into account the dropped instances: {(1 - dropped_ratio) * test_acc:.4f}%") diff --git a/experiments/ddxplus/.env.local b/experiments/ddxplus/.env.local new file mode 100644 index 0000000..9d63cbd --- /dev/null +++ b/experiments/ddxplus/.env.local @@ -0,0 +1,2 @@ +MONGO_URI="mongodb://localhost:27017" +DATABASE=ddxplus \ No newline at end of file diff --git a/experiments/ddxplus/README.md b/experiments/ddxplus/README.md new file mode 100644 index 0000000..6f7c982 --- /dev/null +++ b/experiments/ddxplus/README.md @@ -0,0 +1,65 @@ +# DDXPlus Experiments + +In this experiment we train the model on the [DDXPlus dataset](https://arxiv.org/abs/2205.09148), a dataset for automated medical diagnosis. We devise a task to predict the most likely differential diagnoses for each instance, a multi-label prediction task. + +For ORiGAMi, we reformat the dataset into JSON format with two different representations: + +- A flat representation, in which we store the evidences and their values as strings. +- An object representation, where the evidences are stored as object containing array values. + +We compare our model against baselines: Logistic Regression, Random Forests, XGBoost, LightGBM. The baselines are trained on a +flat representation by converting the evidence-value strings into a multi-label binary matrix. We wrap each model in a scikit-learn +[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html). + +First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `ddxplus` directory. + +## ORiGAMi + +We train a model with the `medium` size preset by default: 6 layers, 6 heads, 192 embedding dimensionality. To train with other model sizes, append `model_size=` to the command, using one of the following options: `xs`, `small`, `medium`, `large`, `xl`. + +To train and evaluate ORiGAMi on the flat evidences structure, run the following: + +```bash +guild run origami:train evidences=flat eval_data=test seed="[1, 2, 3, 4, 5]" +``` + +For the object representation of evidences, run instead: + +```bash +guild run origami:train evidences=object eval_data=test seed="[1, 2, 3, 4, 5]" +``` + +This will repeat the training and evaluation 5 times with different random seeds and evaluate on the test set. + +## Baselines + +### Hyperparameter optimization + +First perform HPO, supplying the `` as one of `lr` (Logistic Regression), `rf` (Random Forest), `xgb` (XGBoost), `lgb` (LightGBM) and the appropriate number of trial runs with `--max-trials `, and give the run a name with `