## Training a RandomForestClassifier on the Dungeons dataset

### The Dungeons Dataset

The Dungeons dataset is a (dungeons-themed) challenging synthetic dataset for supervised classification on
semi-structured data.

Each instance constains a corridor array with several rooms. Each room has a door number and contains multiple
treasure chests with different-colored keys. All but one of the treasures are fake though.

The goal is to find the correct room number and key color in each dungeon based on some clues and return the
only real treasure. The clues are given at the top-level of the object in the fields `door` and `key_color`.

To make it even harder, the `corridor` array may be shuffled (`shuffle_rooms=True`), and room objects may
have a number of monsters as their first field (`with_monsters=True`), shifting the token positions of the
serialized object by a variable amount.

The following dictionary represents one example JSON instance:

```json
{
  "door": 1, // clue which door is the correct one
  "key_color": "blue", // clue which key is the correct one
  "corridor": [
    {
      "monsters": ["troll", "wolf"], // optional monsters in front of the door
      "door_no": 1, // door number in the corridor
      "red_key": "gemstones", // different keys return different treasures,
      "blue_key": "spellbooks", // but only one is real, the others are fake
      "green_key": "artifacts"
    },
    {
      // another room
      "door_no": 0, // rooms can be shuffled, here room 0 comes after 1
      "red_key": "diamonds",
      "blue_key": "gold",
      "green_key": "gemstones"
    }
    // ... more doors ...
  ],
  "treasure": "spellbooks" // correct treasure (target label)
}
```

The correct answer for this instance is "spellbooks", because the `door` is 1 and the `key_color` is "blue".


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

from origami.datasets.dungeons import generate_data
from origami.utils import flatten_docs

# generate Dungeons dataset (see origami/datasets/dungeons.py)
data = generate_data(
    num_instances=10_000,
    num_doors_range=(5, 10),
    num_colors=3,
    num_treasures=5,
    with_monsters=True,  # makes it harder as token positions get shifted by variable amount
    shuffle_rooms=True,  # makes it harder because rooms are in random order
)

# flatten docs, load into dataframe and split into train/test
df = pd.DataFrame(flatten_docs(data))
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True)

TARGET_FIELD = "treasure"

train_df.head()

Unnamed: 0,door,key_color,corridor.[0].monsters.[0],corridor.[0].door_no,corridor.[0].red_key,corridor.[0].blue_key,corridor.[0].green_key,corridor.[1].monsters.[0],corridor.[1].door_no,corridor.[1].red_key,...,corridor.[9].monsters.[0],corridor.[9].monsters.[1],corridor.[9].door_no,corridor.[9].red_key,corridor.[9].blue_key,corridor.[9].green_key,corridor.[6].monsters.[1],corridor.[7].monsters.[0],corridor.[8].monsters.[1],corridor.[7].monsters.[1]
4074,8,green,troll,1,artifacts,gold,gold,,4,gold,...,,,,,,,,dragon,,orc
5569,9,blue,troll,9,gold,spellbooks,diamonds,troll,0,spellbooks,...,dragon,,8.0,gemstones,artifacts,artifacts,,goblin,,troll
5971,8,green,goblin,7,gemstones,spellbooks,artifacts,,2,gold,...,,,,,,,,goblin,,wolf
4665,1,red,,7,diamonds,gemstones,spellbooks,wolf,6,gold,...,,,,,,,,orc,,troll
6380,1,green,wolf,4,gold,gemstones,diamonds,,2,spellbooks,...,,,,,,,,,,


### Random Forest Classifier

We will attempt to learn the same Dungeons dataset as used in `example_origami_dungeons.ipynb` with a
[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
from scikit-learn.

We recursively flatten the dataset, creating a column for each field path (e.g. `corridor.2.blue_key`). The we
transform all features through one-hot encoding, including the numeric fields (`door` and `door_no`) as these are
of low cardinality (here max. 10) and better treated as categorical data.

Next we conduct a hyper-parameter search over 100 configurations with 5-fold cross-validation on the training portion
of the data. The best model is fitted on the training data and we report classification on the test data.

Despite extensive parameter search, the best model achieves a test accuracy of ~0.32, which is only marginally better
than random guessing (0.2) as we have 5 treasure types to choose from.


In [2]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# extract target
y_train = train_df[TARGET_FIELD]
y_test = test_df[TARGET_FIELD]

# remove target from features
X_train = train_df.drop(TARGET_FIELD, axis=1)
X_test = test_df.drop(TARGET_FIELD, axis=1)

# preprocess categorical features
cat_features = X_train.columns

# replace all categorical nan values with "n/a" string
X_train[cat_features] = X_train[cat_features].fillna("n/a")
X_test[cat_features] = X_test[cat_features].fillna("n/a")

# convert categorical features to strings and one-hot encode
X_train[cat_features] = X_train[cat_features].astype("string")
X_test[cat_features] = X_test[cat_features].astype("string")
cat_steps = [("encoder", OneHotEncoder(handle_unknown="ignore"))]

preprocessor = Pipeline(steps=cat_steps)

# fit and transform categorical features
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

display(f"{X_train.shape=}")

# label-encode targets
label_encoder = LabelEncoder()

label_encoder.fit(pd.concat((y_train, y_test), axis=0))
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)

# define the parameter space for hyperparameter tuning
param_dist = {
    "n_estimators": [int(x) for x in np.linspace(start=200, stop=2000, num=10)],
    "max_features": ["log2", "sqrt"],
    "max_depth": [int(x) for x in np.linspace(10, 110, num=11)] + [None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False],
}

# create a base model
rf = RandomForestClassifier()

# instantiate the randomized search
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, n_iter=100, cv=5, verbose=2, random_state=42, n_jobs=-1
)

# fit the random search model
random_search.fit(X_train, y_train)

# get the best model and fit on full training data
best_model = random_search.best_estimator_
best_model.fit(X_train, y_train)

# evaluate the best model
y_pred_train = best_model.predict(X_train)
train_acc = accuracy_score(y_train, y_pred_train)

y_pred_test = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred_test)

print("-" * 80)
print(f"Best parameters: {random_search.best_params_}")
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

'X_train.shape=(8000, 403)'

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=  11.3s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=  11.4s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=  11.4s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=  11.4s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=  11.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; total time=  16.5s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; to



[CV] END bootstrap=False, max_depth=100, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=1000; total time=  39.1s
[CV] END bootstrap=False, max_depth=100, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=1000; total time=  39.6s
[CV] END bootstrap=False, max_depth=100, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=1000; total time=  39.6s
[CV] END bootstrap=False, max_depth=60, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=600; total time=  26.0s
[CV] END bootstrap=False, max_depth=60, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=600; total time=  26.0s
[CV] END bootstrap=False, max_depth=60, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=600; total time=  26.3s
[CV] END bootstrap=False, max_depth=60, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=600; total time=  26.6s
[CV] END bootstrap=False, max_depth