LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Paper (ICLR 2024) | Project Page

Jae-Woo Choi^1*, Youngwoo Yoon^1*, Hyobin Ong^{1, 2}, Jaehong Kim¹, Minsu Jang^{1, 2} (*equal contribution)

¹ Electronics and Telecommunications Research Institute, ² University of Science and Technology

We introduce a system for automatically quantifying performance of task planning for home-service agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several extentions of the baseline planner.

Environment

Ubuntu 14.04+ is required. The scripts were developed and tested on Ubuntu 22.04 and Python 3.8.

You can use WSL-Ubuntu on Windows 10/11.

Install

Clone the whole repo.
```
$ git clone {repo_url}
```

Setup a virtual environment.

$ conda create -n {env_name} python=3.8
$ conda activate {env_name}

Install PyTorch (2.0.0) first (see https://pytorch.org/get-started/locally/).

# exemplary install command for PyTorch 2.0.0 with CUDA 11.7
$ pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 --index-url https://download.pytorch.org/whl/cu117

Install python packages in requirements.txt.
```
$ pip install -r requirements.txt
```

Benchmarking on ALFRED

Download ALFRED dataset.

$ cd alfred/data
$ sh download_data.sh json

Benchmarking

$ python src/evaluate.py --config-name=config_alfred

You can override the configuration. We used Hydra for configuration management.

$ python evaluate.py --config-name=config_alfred planner.model=EleutherAI/gpt-neo-125M
$ python evaluate.py --config-name=config_alfred alfred.x_display='1'
$ python evaluate.py --config-name=config_alfred alfred.eval_portion_in_percent=100 prompt.num_examples=18

Headless Server

Please run startx.py script before running ALFRED experiment on headless servers. Below script uses 1 for the X_DISPLAY id, but you can use different ids such as 0.

$ sudo python3 alfred/scripts/startx.py 1

Benchmarking on Watch-And-Help

Download the VirtualHome Simulator

Download the VirtualHome simulator v2.2.2 and extract it

$ cd {project_root}/virtualhome/simulation/unity_simulator/
$ wget http://virtual-home.org//release/simulator/v2.0/v2.2.2/linux_exec.zip
$ unzip linux_exec.zip

Benchmarking on Watch-And-Help-NL

Open a new terminal and run VirtualHome simulator

$ cd {project_root}
$ ./virtualhome/simulation/unity_simulator/linux_exec.x86_64

Open another terminal and evaluate.

$ cd {project_root}
$ python src/evaluate.py --config-name=config_wah

You can override the configuration. We used Hydra for configuration management.

$ cd {project_root}
$ python evaluate.py --config-name=config_wah planner.model_name=EleutherAI/gpt-neo-1.3B prompt.num_examples=10

Benchmarking on Watch-And-Help-NL Using Headless PC

Open a new terminal and run Xserver

$ cd {project}/virtualhome
$ sudo python helper_scripts/startx.py $display_num

Open another terminal and run unity simulator

$ cd {project}/virtualhome
$ DISPLAY=:$display_num ./simulation/unity_simulator/linux_exec.x86_64 -batchmode

Open another terminal and evaluate

$ cd {project_root}
$ python src/evaluate.py --config-name=config_wah_headless

Extensions

In-context example selection

$ python src/evaluate.py --config-name=config_wah prompt.select_method=same_task
$ python src/evaluate.py --config-name=config_wah prompt.select_method=topk

Replanning

$ python src/evaluate.py --config-name=config_alfred planner.use_predefined_prompt=True

Extract train samples from ALFRED for language model finetuning

Make sure you have preprocessed data (run ALFRED benchmarking at least once).

$ python src/misc/extract_alfred_train_samples.py

WAH-NL Dataset

You can find the WAH-NL data, which is our extension of WAH, in ./dataset folder.

FAQ

Running out of disk space for Huggingface models
- You can set the cache folder to be in another disk.
```
$ export TRANSFORMERS_CACHE=/mnt/otherdisk/.hf_cache/
```
I have encountered 'cannot find X server with xdpyinfo' in running ALFRED experiments.
- Please try another x_display number (this should be a string; e.g., '1') in the config file.
```
$ python evaluate.py --config-name=config_alfred alfred.x_display='1'
```

Citation

@inproceedings{choi2024lota,
  title={LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents},
  author={Choi, Jae-Woo and Yoon, Youngwoo and Ong, Hyobin and Kim, Jaehong and Jang, Minsu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
alfred		alfred
conf		conf
dataset		dataset
resource		resource
src		src
virtualhome		virtualhome
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

lbaa2022/LLMTaskPlanning

Folders and files

Latest commit

History

Repository files navigation