Open-Vocabulary Instance Segmentation and Re-identification Benchmark
For the development of effective 3D perception and object-level SLAM systems, it's crucial to reliably detect, segment, and re-identify objects. Moreover, achieving this shouldn't necessitate the use of millions of training examples. We've created this benchmark to highlight this issue and expedite research into algorithms for single-shot and few-shot object instance segmentation and re-identification.
This work has been tested in a python 3.9 environment.
-
Install dependencies
pip install cython pip install -r requirements.txt -
Download Dataset (>70GB)
dataset release pending
-
Download model weights
python3 ./isar/util/download_model_weights.py -
run benchmark
python3 benchmark.py (...)
CLI arguments:
- mc, method_config - path to method config file
- d, datadir - path to directory of dataset
- o, outdir - path to output directory
- -dev, device - device to use. Choices: ["cpu", "cuda"]
The easiest way to test a new method on the dataset and recieve results in the same format as the baseline method is:
- Create new class inheriting from detector.GenericDetector
- Implement all member functions
- Replace the detector in benchmark.Benchmark with your own implementation
the datasets are structured as follows:
Dataset_name
|--multi_object
--|--task_name
----|--info.json (task info)
----|--train
------|--scene_name
--------|--attributes.json (scene attributes)
--------|--camera_poses.json (6DOF camera pose of each frame)
--------|--color_map.json (unique mapping: semantic_id->rgb_color of scene)
--------|--prompts_single.json (prompts for single-shot case)
--------|--prompts_multi.json (prompts for multi-shot case)
--------|--rgb
-----------|--xxxxxxx.jpg
-----------|-- ...
--------|--(optional:depth)
-----------|--xxxxxxx.png
-----------|-- ...
------|-- ...
----|--test
------|--scene_name
--------|--attributes.json (scene attributes)
--------|--camera_poses.json (6DOF camera pose of each frame)
--------|--color_map.json (unique mapping: semantic_id->rgb_color of scene)
--------|--rgb
-----------|--xxxxxxx.jpg
-----------|-- ...
--------|--(optional:depth)
-----------|--xxxxxxx.png
-----------|-- ...
--------|--semantic (this is used for visualization)
-----------|--xxxxxxx.png
-----------|-- ...
--------|--semantic_raw (this is used for eval)
-----------|--xxxxxxx.png
-----------|-- ...
------|-- ...
--|-- ...
The ISAR benchmark dataset is a synthetic dataset built with the AI-Habitat Simulator using data of the Replica-Dataset, the habitat-matterport-3D-dataset and the ycb-object-and-model-set. Further it uses the objects mentioned in ./isar/attribution/README.md .
The baseline method of ISAR builds on and utilizes previous works such as Segment Anything and DINOv2. Legacy versions of the method build on OW-DETR (which builds on Deformable DETR, Detreg, and OWOD) and CLIP.
When using the dataset in your research, please also cite: