## Training an ORiGAMi model on the Dungeons dataset

### The Dungeons Dataset

The Dungeons dataset is a (dungeons-themed) challenging synthetic dataset for supervised classification on
semi-structured data.

Each instance constains a corridor array with several rooms. Each room has a door number and contains multiple
treasure chests with different-colored keys. All but one of the treasures are fake though.

The goal is to find the correct room number and key color in each dungeon based on some clues and return the
only real treasure. The clues are given at the top-level of the object in the fields `door` and `key_color`.

To make it even harder, the `corridor` array may be shuffled (`shuffle_rooms=True`), and room objects may
have a number of monsters as their first field (`with_monsters=True`), shifting the token positions of the
serialized object by a variable amount.

The following dictionary represents one example JSON instance:

```json
{
  "door": 1, // clue which door is the correct one
  "key_color": "blue", // clue which key is the correct one
  "corridor": [
    // a corridor with many doors
    {
      "monsters": ["troll", "wolf"], // optional monsters in front of the door
      "door_no": 1, // door number in the corridor
      "red_key": "gemstones", // different keys return different treasures,
      "blue_key": "spellbooks", // but only one is real, the others are fake
      "green_key": "artifacts"
    },
    {
      // another room, here without monsters
      "door_no": 0, // rooms can be shuffled, here room 0 comes after 1
      "red_key": "diamonds",
      "blue_key": "gold",
      "green_key": "gemstones"
    }
    // ... more rooms ...
  ],
  "treasure": "spellbooks" // correct treasure (target label)
}
```

The correct answer for this instance is "spellbooks", because the `door` is 1 and the `key_color` is "blue".


### Preprocessing

The JSON objects are tokenized by recursively walking through them depth-first and extracting key and value tokens.
Additionally, when encountering arrays or nested objects, special grammar tokens are included in the sequence.
This diagram illustrates tokenization.

<img src="../assets/preprocessing-diagram.png" width="600px" />


In [1]:
import json

from sklearn.model_selection import train_test_split

from origami.datasets.dungeons import generate_data
from origami.preprocessing import build_prediction_pipelines, docs_to_df
from origami.utils import set_seed
from origami.utils.config import PipelineConfig

# for reproducibility
set_seed(123)

# generate Dungeons dataset (see origami/datasets/dungeons.py)
data = generate_data(
    num_instances=10_000,
    num_doors_range=(4, 8),
    num_colors=3,
    num_treasures=5,
    with_monsters=True,  # makes it harder as token positions get shifted by variable amount
    shuffle_rooms=True,  # makes it harder because rooms are in random order
    shuffle_keys=True,  # makes it harder because keys are in random order
)

# print example dictionary
print(json.dumps(data[0], indent=2))

# load data into dataframe and split into train/test
df = docs_to_df(data)
train_docs_df, test_docs_df = train_test_split(df, test_size=0.2, shuffle=True)

TARGET_FIELD = "treasure"

# create train and test pipelines
pipelines = build_prediction_pipelines(
    pipeline_config=PipelineConfig(sequence_order="ORDERED", upscale=1), target_field=TARGET_FIELD
)

# process train, eval and test data
train_df = pipelines["train"].fit_transform(train_docs_df)
test_df = pipelines["test"].transform(test_docs_df)

# get stateful objects
schema = pipelines["train"]["schema"].schema
encoder = pipelines["train"]["encoder"].encoder
block_size = pipelines["train"]["padding"].length

# print data stats
print(f"len train: {len(train_df)}, len test: {len(test_df)}")
print(f"vocab size {encoder.vocab_size}")
print(f"block size {block_size}")

  from .autonotebook import tqdm as notebook_tqdm


{
  "door": 0,
  "key_color": "blue",
  "corridor": [
    {
      "door_no": 1,
      "red_key": "spellbooks",
      "green_key": "gemstones",
      "blue_key": "gemstones"
    },
    {
      "monsters": [
        "orc",
        "wolf"
      ],
      "door_no": 2,
      "red_key": "gold",
      "blue_key": "gold",
      "green_key": "artifacts"
    },
    {
      "door_no": 0,
      "green_key": "spellbooks",
      "red_key": "artifacts",
      "blue_key": "diamonds"
    },
    {
      "door_no": 3,
      "red_key": "diamonds",
      "blue_key": "spellbooks",
      "green_key": "diamonds"
    }
  ],
  "treasure": "diamonds"
}
len train: 8000, len test: 2000
vocab size 49
block size 123


In [12]:
# optionally save dungeon dataset to MongoDB
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
collection = client.dungeons.dungeon_10k_4_8_3_5_mkr
collection.insert_many(data)

InsertManyResult([ObjectId('671aff233a5991b253bc21cf'), ObjectId('671aff233a5991b253bc21d0'), ObjectId('671aff233a5991b253bc21d1'), ObjectId('671aff233a5991b253bc21d2'), ObjectId('671aff233a5991b253bc21d3'), ObjectId('671aff233a5991b253bc21d4'), ObjectId('671aff233a5991b253bc21d5'), ObjectId('671aff233a5991b253bc21d6'), ObjectId('671aff233a5991b253bc21d7'), ObjectId('671aff233a5991b253bc21d8'), ObjectId('671aff233a5991b253bc21d9'), ObjectId('671aff233a5991b253bc21da'), ObjectId('671aff233a5991b253bc21db'), ObjectId('671aff233a5991b253bc21dc'), ObjectId('671aff233a5991b253bc21dd'), ObjectId('671aff233a5991b253bc21de'), ObjectId('671aff233a5991b253bc21df'), ObjectId('671aff233a5991b253bc21e0'), ObjectId('671aff233a5991b253bc21e1'), ObjectId('671aff233a5991b253bc21e2'), ObjectId('671aff233a5991b253bc21e3'), ObjectId('671aff233a5991b253bc21e4'), ObjectId('671aff233a5991b253bc21e5'), ObjectId('671aff233a5991b253bc21e6'), ObjectId('671aff233a5991b253bc21e7'), ObjectId('671aff233a5991b253bc21

### ORiGAMi Model

Here we instantiate an ORiGAMi model, a modified transformer trained on the token sequences created above.
We use a standard "medium" configuration. ORiGAMi models are relatively robust to the choice of hyper-parameter
and default configurations often work well for mid-sized datasets.


In [2]:
from origami.model import ORIGAMI
from origami.model.vpda import ObjectVPDA
from origami.preprocessing import DFDataset
from origami.utils import ModelConfig, TrainConfig, count_parameters

# model and train configs
model_config = ModelConfig.from_preset("medium")  # see origami/utils/config.py for different presets
model_config.position_encoding = "KEY_VALUE"
model_config.vocab_size = encoder.vocab_size
model_config.block_size = block_size

train_config = TrainConfig()
train_config.learning_rate = 1e-3
train_config.print_every = 10
train_config.eval_every = 500

# wrap dataframes in datasets
train_dataset = DFDataset(train_df)
test_dataset = DFDataset(test_df)

# create PDA and pass it to the model
vpda = ObjectVPDA(encoder, schema)
model = ORIGAMI(model_config, train_config, vpda=vpda)

n_params = count_parameters(model)
print(f"Number of parameters: {n_params / 1e6:.2f}M")

Number of parameters: 2.69M


In [3]:
from origami.inference import Predictor
from origami.utils import make_progress_callback

# create a predictor
predictor = Predictor(model, encoder, TARGET_FIELD)

# create and register progress callback
progress_callback = make_progress_callback(
    train_config, train_dataset=train_dataset, test_dataset=test_dataset, predictor=predictor
)
model.set_callback("on_batch_end", progress_callback)

# train model (train and test accuracy should start to go towards 1.0 after ~3000 batches as loss drops below 0.7)
model.train_model(train_dataset, batches=5000)

|  step: 0  |  epoch: 0  |  batch_num: 0  |  batch_dt: 0.00  |  batch_loss: 2.6713  |  lr: 1.01e-06  |  train_acc: 0.0000  |  test_loss: 2.6694  |  test_acc: 0.0000  |
|  step: 1  |  epoch: 0  |  batch_num: 10  |  batch_dt: 301.62  |  batch_loss: 2.5666  |  lr: 1.10e-05  |
|  step: 2  |  epoch: 0  |  batch_num: 20  |  batch_dt: 303.20  |  batch_loss: 2.3215  |  lr: 2.10e-05  |
|  step: 3  |  epoch: 0  |  batch_num: 30  |  batch_dt: 305.61  |  batch_loss: 2.1036  |  lr: 3.10e-05  |
|  step: 4  |  epoch: 0  |  batch_num: 40  |  batch_dt: 300.66  |  batch_loss: 1.9355  |  lr: 4.10e-05  |
|  step: 5  |  epoch: 0  |  batch_num: 50  |  batch_dt: 308.98  |  batch_loss: 1.7822  |  lr: 5.10e-05  |
|  step: 6  |  epoch: 0  |  batch_num: 60  |  batch_dt: 304.86  |  batch_loss: 1.6068  |  lr: 6.10e-05  |
|  step: 7  |  epoch: 0  |  batch_num: 70  |  batch_dt: 302.95  |  batch_loss: 1.4233  |  lr: 7.10e-05  |
|  step: 8  |  epoch: 1  |  batch_num: 80  |  batch_dt: 310.33  |  batch_loss: 1.3163  |  

In [4]:
# calculate test accuracy
acc = predictor.accuracy(test_dataset, show_progress=True)
print(f"Test accuracy: {acc:.4f}")

# we can also access the predictions with the `predict()` method
predictions = predictor.predict(test_dataset)
print("Model predictions: ", predictions[:10])
print("Correct labels: ", test_dataset.df["target"].to_list()[:10])

Predicting: 100%|██████████| 68/68 [00:20<00:00,  3.39it/s]


Test accuracy: 0.9975
Model predictions:  ['gemstones', 'artifacts', 'gemstones', 'spellbooks', 'gemstones', 'gold', 'gold', 'diamonds', 'artifacts', 'gold']
Correct labels:  ['gemstones', 'artifacts', 'gemstones', 'spellbooks', 'gemstones', 'gold', 'gold', 'diamonds', 'artifacts', 'gold']
