# Token Datasets

Each model is trained on a token sequence dataset stored in a MongoDB instance.
These datasets can be accessed by first downloading the archive files listed in the following table and importing them into a MongoDB instance.

| File                                                                                                            | Size  | Description                      |
| --                                                                                                              | ---   | ---                              |
| [maze.gz                   ](https://dl.fbaipublicfiles.com/searchformer/tokenSeqDB/maze.gz                   ) | 25GB  | Maze token training data         |
| [maze.vocabulary.gz        ](https://dl.fbaipublicfiles.com/searchformer/tokenSeqDB/maze.vocabulary.gz        ) | 7.2kB | Maze token meta data             |
| [sokoban.gz                ](https://dl.fbaipublicfiles.com/searchformer/tokenSeqDB/sokoban.gz                ) | 1.4GB | Sokoban token training data      |
| [sokoban.vocabulary.gz     ](https://dl.fbaipublicfiles.com/searchformer/tokenSeqDB/sokoban.vocabulary.gz     ) | 312B  | Sokoban token meta data          |
| [searchformer.gz           ](https://dl.fbaipublicfiles.com/searchformer/tokenSeqDB/searchformer.gz           ) | 8.6GB | Searchformer token training data |
| [searchformer.vocabulary.gz](https://dl.fbaipublicfiles.com/searchformer/tokenSeqDB/searchformer.vocabulary.gz) | 1.9MB | Searchformer token meta data     |

**To properly import and index the dataset, both the token training data itself and the meta data must be imported into MongoDB.**

Data can be imported from these files into a live MongoDB instance with [`mongorestore`](https://www.mongodb.com/docs/database-tools/mongorestore/).
For example, to import the maze datasets to a MongoDB instance running on localhost with port the default port, run

```
mongorestore --gzip --archive=maze.vocabulary.gz
mongorestore --gzip --archive=maze.gz 
```

Once all imports completed, the imported datasets can be listed by running 

```
python -m searchformer.trace list-token-datasets
```


The following datasets are included in the files `maze.gz` and `maze.vocabulary.gz`.

| Experiment                   | Dataset Name                                  |
| ---                          | ---                                           |
| 10x10 Maze, deterministic    | `maze.10-by-10-deterministic.simple`          |
| 20x20 Maze, deterministic    | `maze.20-by-20-deterministic.simple`          |
| 30x30 Maze, deterministic    | `maze.30-by-30-deterministic.simple`          |
| 10x10 Maze, nondeterministic | `maze.10-by-10-nondeterministic.simple`       |
| 20x20 Maze, nondeterministic | `maze.20-by-20-nondeterministic.simple`       |
| 30x30 Maze, nondeterministic | `maze.30-by-30-nondeterministic.simple`       |


The following datasets are included in the files `sokoban.gz` and `sokoban.vocabulary.gz`.

| Experiment                   | Dataset Name                                  |
| ---                          | ---                                           |
| Sokoban                      | `sokoban.7-by-7-walls-2-boxes-2.with-box-40k` |


The following datasets are included in the files `searchformer.gz` and `searchformer.vocabulary.gz`.

| Model        | Step | Repeat | Checkpoint Name                         | Dataset Name                        |
| ---          | ---  | ---    | ---                                     | ---                                 |
| Searchformer | 1    | 0      | sokoban-7722-m-trace-plan-100k-0-step-1 | `65b8382b9ee4fbaa76e005b7.improved` |
| Searchformer | 1    | 1      | sokoban-7722-m-trace-plan-100k-1-step-1 | `65b8398ff7574141c3ba77ae.improved` |
| Searchformer | 1    | 2      | sokoban-7722-m-trace-plan-100k-2-step-1 | `65b8495e2382373d6a21ca99.improved` |
| Searchformer | 2    | 0      | sokoban-7722-m-trace-plan-100k-0-step-2 | `65ba856d986d307d60c563ca.improved` |
| Searchformer | 2    | 1      | sokoban-7722-m-trace-plan-100k-1-step-2 | `65ba8e2678ad82e62025d0c6.improved` |
| Searchformer | 2    | 2      | sokoban-7722-m-trace-plan-100k-2-step-2 | `65ba8ee773586ec314e0da96.improved` |
| Searchformer | 3    | 0      | sokoban-7722-m-trace-plan-100k-0-step-3 | `65c8b912c3dd164d1eb691a4.improved` |
| Searchformer | 3    | 1      | sokoban-7722-m-trace-plan-100k-1-step-3 | `65ca34d2e25a422484c5d3da.improved` |
| Searchformer | 3    | 2      | sokoban-7722-m-trace-plan-100k-2-step-3 | `65ca57d67f455f390d05bf33.improved` |

## Loading a token dataset

A token dataset is loaded by instantiating an object of the class `searchformer.trace.TokenizedDataset` and passing the dataset name into the class constructor.
This object is then used to access the token sequences.

In [3]:
import sys; sys.path.append("..")

import logging
from searchformer.trace import TokenizedDataset


logging.basicConfig(
    level=logging.WARN,
    format="%(levelname)s - %(asctime)s - %(name)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)


tok_dataset = TokenizedDataset("maze.10-by-10-deterministic.simple")
print(f"Number of train sequences: {len(tok_dataset.train_ids)}")
print(f"Number of test sequences:  {len(tok_dataset.test_ids)}")
print(f"Vocabulary size: {len(tok_dataset.vocabulary)}")

INFO - 2024-08-25 13:52:34 - root - Connecting to mongodb://localhost:27017/mongo
DEBUG - 2024-08-25 13:52:34 - pymongo.connection - {"clientId": {"$oid": "66cb6f625ef37adb1ce5cc79"}, "message": "Connection pool created", "serverHost": "localhost", "serverPort": 27017}
DEBUG - 2024-08-25 13:52:34 - root - Loading all ids from Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, sockettimeoutms=1800000, connecttimeoutms=1800000), 'tokenSeqDB'), 'maze.10-by-10-deterministic.simple.meta.train') ...
DEBUG - 2024-08-25 13:52:34 - pymongo.connection - {"clientId": {"$oid": "66cb6f625ef37adb1ce5cc79"}, "message": "Connection pool ready", "serverHost": "localhost", "serverPort": 27017}
DEBUG - 2024-08-25 13:52:34 - pymongo.serverSelection - {"message": "Server selection started", "selector": "Primary()", "operation": "find", "topologyDescription": "<TopologyDescription id: 66cb6f625ef37adb1ce5cc79, topology_type: Single, servers: [<Server

Number of train sequences: 1000000
Number of test sequences:  100000
Vocabulary size: 116


A tokenized sequence is represented with the data class `searchformer.trace.TokenizedTrace`.
This class is returned by the training or test iterators.

In [4]:
tok_trace = next(iter(tok_dataset.train_it(tok_dataset.train_ids)))[0]

prompt_str = " ".join(tok_trace.prompt).replace("start", "\n\tstart").replace("goal", "\n\tgoal ").replace("wall", "\n\twall ")
execution_trace_str = " ".join(tok_trace.reasoning).replace("create", "\n\tcreate").replace("close", "\n\tclose ")
plan_str = " ".join(tok_trace.plan).replace("plan", "\n\tplan")
print(f"Prompt:          {prompt_str}")
print(f"Execution trace: {execution_trace_str}")
print(f"Plan:            {plan_str}")

DEBUG - 2024-08-24 20:55:58 - root - Loading all ids from Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, sockettimeoutms=1800000, connecttimeoutms=1800000), 'tokenSeqDB'), 'maze.10-by-10-deterministic.simple.meta.train') ...
DEBUG - 2024-08-24 20:55:58 - pymongo.serverSelection - {"message": "Server selection started", "selector": "Primary()", "operation": "find", "topologyDescription": "<TopologyDescription id: 66ca8060a951b899936dc8a5, topology_type: Single, servers: [<ServerDescription ('localhost', 27017) server_type: Standalone, rtt: 0.024108341864459003>]>", "clientId": {"$oid": "66ca8060a951b899936dc8a5"}}
DEBUG - 2024-08-24 20:55:58 - pymongo.serverSelection - {"message": "Server selection succeeded", "selector": "Primary()", "operation": "find", "topologyDescription": "<TopologyDescription id: 66ca8060a951b899936dc8a5, topology_type: Single, servers: [<ServerDescription ('localhost', 27017) server_type: Standalone, 

{"_id": -8356109747986356151, "prompt": ["start", "1", "3", "goal", "9", "4", "wall", "0", "0", "wall", "3", "0", "wall", "4", "0", "wall", "6", "0", "wall", "7", "0", "wall", "6", "1", "wall", "8", "1", "wall", "1", "2", "wall", "3", "2", "wall", "6", "2", "wall", "7", "2", "wall", "2", "3", "wall", "4", "3", "wall", "5", "3", "wall", "6", "3", "wall", "7", "3", "wall", "7", "4", "wall", "8", "4", "wall", "1", "5", "wall", "2", "5", "wall", "4", "5", "wall", "5", "6", "wall", "6", "6", "wall", "0", "7", "wall", "2", "7", "wall", "5", "7", "wall", "8", "8", "wall", "1", "9", "wall", "7", "9", "wall", "8", "9", "wall", "9", "9"], "reasoning": ["create", "1", "3", "c0", "c9", "close", "1", "3", "c0", "c9", "create", "1", "4", "c1", "c8", "create", "0", "3", "c1", "c10", "close", "1", "4", "c1", "c8", "create", "0", "4", "c2", "c9", "create", "2", "4", "c2", "c7", "close", "2", "4", "c2", "c7", "create", "3", "4", "c3", "c6", "close", "3", "4", "c3", "c6", "create", "3", "5", "c4", "c7", 

Exception: Stop here