Visualization-of-Thought (VoT) prompting is designed to enhance the spatial reasoning abilities of large language models (LLMs) by visualizing their reasoning traces, thus guiding subsequent reasoning steps. This approach leverages the concept of the "mind's eye" in human cognition, which refers to the ability to visualize and manipulate mental images. By emulating this cognitive process, VoT has been applied to tasks such as natural language navigation, visual navigation, and visual tiling in 2D grid worlds, significantly improving the performance of LLMs in these areas.

Before you begin, ensure you have met the following requirements:
- You have installed Python 3.x.
- You have installed nodejs (needed for data augmentation).
Clone the repository and install the required dependencies:
git clone https://github.com/microsoft/Visualization-of-Thought.git
cd Visualization-of-Thought
pip install -r src/requirements.txtYou need to download the dataset and then generate prompts for different settings. The dataset includes tasks designed to evaluate the spatial reasoning capabilities of LLMs:
- Natural Language Navigation:
- A square map defined by a sequence of random walk instructions and associated objects.
- Task: Identify the associated object at a specified location determined by navigation instructions.
- Data generation is implemented by SpatialEvalLLM repository. Use the following command to generate the data:
python square.py --seed 8 --size 4 --steps 8 --maptype square --label_path ./labels/imagenetsimple.json --n_sample 200 --out_dir results_map_global --special_order snake_order- Visual Tasks:
Download the dataset of visual tasks via this link, and place it under the root folder of this repo.
mkdir -p dataset
unzip VoT-Dataset-Visual-Tasks.zip -d dataset
cd src
# fill in prompt template for different settings
sh patch-prompt.sh ../datasetPlease notice that prompts of different settings (CoT/VoT/GPT-4V CoT) for each instance have been removed from this released version. The patch-prompt.sh script is provided to automatically fill in prompt templates for all experiment settings across tasks. Prompt templates are stored in the prompts folder under each visual task. For example:
- visual-navigation/route-planning/prompts/{setting}.txt
- visual-navigation/next-step-prediction/prompts/{setting}.txt
- visual-tiling/prompts/{setting}.txt
To create your own prompt template for a new setting, simply add the template file under the prompts folder of the relevant task. An example of a template can be found in the VoT template.
Sample codes are provided for each task to run experiments. You need to implement the run_llm_client function for each visual task, which writes response to specified output path. Then run following command for evaluation:
python visual-navigation/route-planning/sample.py --jsonl-path ../dataset/visual-navigation/route-planning.jsonl --output-folder {output-folder} --setting {setting}
python visual-navigation/next-step-prediction/sample.py --jsonl-path ../dataset/visual-navigation/next-step-prediction.jsonl --output-folder {output-folder} --setting {setting}
python visual-tiling/sample.py --jsonl-path ../dataset/visual-tiling/visual-tiling.jsonl --output-folder {output-folder} --setting {setting}The performance of specific setting will be printed on the terminal. The path of log file is also provided, which includes all failing cases for debugging purpose.
To be noticed, the LLM-generated responses are parsed based on regex pattern, and default patterns implemented in the code are for GPT-family models. You may need to specify regex patterns for other models and pay attention to "failing to parse" cases in the log file. To use the regex patterns we implemented for LLaMA or your customized patterns, just specify the parameter --regex-path when running the evaluation script.
| Field Name | Description | Type | Example |
|---|---|---|---|
desc |
Text input for LLMs. | String | "" |
desc_multimodal |
Message array input for MLLMs, containing text and images, following Azure OpenAI format. | Array of Messages | [{}] |
answer |
A single string, or a list of strings for route planning tasks. | String or String Array | "A" or ["left", "down"] |
puzzle_path |
Folder of each prompt instance. | String | "puzzles/level-2/103/Tetromino T" |
config_path |
Folder of images and original spatial configurations. | String | "configurations/level-2/103" |
difficulty |
Difficulty level of the question or puzzle. | Integer | 2 |
instance_id |
Relative path to the instance identifier within the puzzle folder. | String | "103/Tetromino T" |
| Field Name | Description | Type | Example |
|---|---|---|---|
question |
The navigation question. | String | "" |
answer |
The name of the object to be found. | String | "Sofa" |
This dataset could be extended by specifying the difficulty. Please refer to the scripts generating the dataset of visual tasks.
- Visual Navigation:
# make sure to switch to the src folder
mkdir -p ../dataset
sh visual-navigation/gen-data.sh ../dataset/visual-navigationRun following commands to extend with a new difficulty level K.
python visual-navigation/gen_all_paths.py --turn {K} --dest-folder ../dataset/visual-navigation/configurations/level-{K}
python visual-navigation/route-planning/gen_puzzle.py --config-folder ../dataset/visual-navigation/configurations/level-{K} --puzzle-folder ../dataset/visual-navigation/route-planning/level-{K} --output-jsonl ../dataset/visual-navigation/route-planning.jsonl --difficulty {K}
python visual-navigation/next-step-prediction/gen_puzzle.py --config-folder ../dataset/visual-navigation/configurations/level-{K} --puzzle-folder ../dataset/visual-navigation/next-step-prediction/level-{K} --output-jsonl ../dataset/visual-navigation/next-step-prediction.jsonl --difficulty {K}- Visual Tiling:
# make sure to switch to the src folder
mkdir -p ../dataset
# uncomment line `npm install` to install dependency node modules.
sh visual-tiling/gen-data.sh ../dataset/visual-tilingRun following commands to extend with a new difficulty level K, rectangle size and polyomino pieces. For example, a 5 * 4 rectangle could be filled by "TTLII" (2 T pieces, 1 L piece and 2 I pieces).
cd visual-tiling/gen-solution
node run.js --width=4 --height=5 --masked=K --dest=../dataset/visual-tiling/configurations/level-{K} --pieces='TTLII'
cd ..
python gen_puzzle.py --config-folder ../dataset/visual-tiling/configurations/level-{K} --puzzle-folder ../dataset/visual-tiling/puzzles/level-{K} --output-jsonl ../dataset/visual-tiling/visual-tiling.jsonl --difficulty {K}This project is licensed under the MIT License. See the LICENSE file for details.
If you have any questions or suggestions, feel free to open an issue in the repository or contact the authors.
If you use this dataset, please cite us:
@misc{wu2024mindseyellmsvisualizationofthought,
title={Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models},
author={Wenshan Wu and Shaoguang Mao and Yadong Zhang and Yan Xia and Li Dong and Lei Cui and Furu Wei},
year={2024},
eprint={2404.03622},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2404.03622},
}