OpenHotels is a large-scale hotel image retrieval benchmark built from hotel-room imagery and associated hotel metadata. The task is hotel-scale retrieval: given a query image, retrieve the matching hotel from a large gallery containing both matching classes and many distractor hotel classes.
This repository will contain the code used to reproduce the OpenHotels experiments. The dataset itself is hosted separately on Hugging Face.
Full dataset:
Representative sample for inspection:
The full release uses tar-sharded image files and JSON metadata. Each image metadata row includes:
path: image member name inside the tar shard.shard: relative path to the tar shard containing the image.hotel_id: anonymized hotel class identifier.room: room identifier associated with the upload when available.timestamp: upload timestamp.
Gallery rows also include is_object plus either view_type for non-object room views or object_type for object-centric images. Test Non-Object rows include view_type; Test Object rows include object_type.
To set up the required environment, run:
conda create -n OpenHotels python=3.12 -c conda-forge -y
conda activate OpenHotels
pip install torch==2.8.0 torchvision==0.23.0 --extra-index-url https://download.pytorch.org/whl/cu129
pip install transformers==4.56.2 torchmetrics==1.8.2 numpy==1.26.4 tqdm pandas==2.3.2 faiss-cpu==1.13.2 sympy==1.13.3 huggingface_hub==0.34.4 requests==2.32.5 pytorch_lightning==2.6.1 pytorch_metric_learning==2.9.0For VPR training and evaluation, we forked serizba/salad, keeping the original codebase intact while adding support for the OpenHotels dataset and introducing our multi-vector SALAD aggregation approach. Clone the fork and switch to the OpenHotels branch:
git clone -b OpenHotels https://github.com/GWUvision/salad.gitInstall the Hugging Face Hub client:
pip install huggingface_hubDownload the representative sample:
python scripts/download/download_openhotels.py --dataset sample --output-dir dataDownload the full dataset:
python scripts/download/download_openhotels.py --dataset full --output-dir dataDownload only metadata:
python scripts/download/download_openhotels.py --dataset full --metadata-only --output-dir dataThe downloaded folder keeps the Hugging Face release structure, including metadata_*.json files and tar shards under shards/.
After downloading the dataset, you must extract the tar shards into flat image directories for faster data loading during benchmarks. You can do this by running:
python scripts/extract_shards.py --data-dir data/fullAll model checkpoints can be found in our Hugging Face collection: OpenHotels Models Collection
imagingforgood/clip-vit-base-patch32-OpenHotelsimagingforgood/dinov2-base-OpenHotelsimagingforgood/vit-base-patch16-224-OpenHotelsimagingforgood/siglip-base-patch16-224-OpenHotels
imagingforgood/salad-OpenHotelsimagingforgood/CosPlace-OpenHotelsimagingforgood/GeM-OpenHotelsimagingforgood/epshn-resnet50-Hotels50kimagingforgood/ConvAP-OpenHotelsimagingforgood/MixVPR-OpenHotelsimagingforgood/salad_multivector-OpenHotels
Users can run foundation model benchmarks using the following command:
python -m benchmark.run_benchmark --model vit --checkpoint imagingforgood/vit-base-patch16-224-OpenHotels --splits test_non_objectUsers can run VPR model benchmarks using the following command:
python eval_OpenHotels.py --hf_repo imagingforgood/CosPlace-OpenHotelsWe provide training code to fine-tune VPR models on OpenHotels. This is done through our fork of serizba/salad, which extends the original repository with:
- An OpenHotels dataloader adapted to our dataset format.
- A multi-vector aggregation approach built on top of the SALAD optimal transport framework.
The aggregation strategy and all other training hyperparameters are controlled via the model_config dict in salad/main.py (or by supplying a JSON override with --model_config_path). To start training, run:
cd salad/
python main.pyOr with a custom config:
cd salad/
python main.py --model_config_path path/to/model_config.jsonCheckpoints are saved under logs/ after each epoch.
| Model | Room R@1 | Room R@5 | Room R@10 | Room R@100 | Object R@1 | Object R@5 | Object R@10 | Object R@100 |
|---|---|---|---|---|---|---|---|---|
openai/clip-vit-base-patch32 |
11.80 | 17.09 | 19.70 | 31.29 | 5.82 | 8.78 | 10.09 | 15.55 |
facebook/dinov2-base |
14.89 | 21.77 | 24.96 | 38.02 | 6.53 | 9.59 | 10.92 | 16.59 |
google/vit-base-patch16-224 |
13.58 | 19.89 | 22.93 | 35.08 | 7.13 | 10.40 | 11.97 | 18.10 |
google/siglip-base-patch16-224 |
15.44 | 22.04 | 25.19 | 39.01 | 9.40 | 13.39 | 15.11 | 22.29 |
Retrieval performance for frozen general-purpose foundation models and after LoRA fine-tuning. Additional zero-shot results can be found in the Appendix.
| Method | Descriptor Size | Room R@1 | Room R@5 | Room R@10 | Room R@100 | Object R@1 | Object R@5 | Object R@10 | Object R@100 |
|---|---|---|---|---|---|---|---|---|---|
| epshn_model (Baseline) | 256 | 22.51 | 32.35 | 36.61 | 51.88 | 5.32 | 8.11 | 9.54 | 15.99 |
| GeM | 1024 | 14.72 | 23.03 | 26.92 | 43.36 | 6.44 | 10.16 | 11.97 | 20.04 |
| MixVPR | 4096 | 18.47 | 27.16 | 31.31 | 48.23 | 9.37 | 13.78 | 15.76 | 23.53 |
| CosPlace | 2048 | 26.37 | 37.13 | 41.77 | 58.38 | 12.85 | 17.91 | 20.11 | 28.92 |
| ConvAP | 8192 | 26.50 | 37.19 | 41.76 | 58.29 | 12.10 | 16.68 | 18.72 | 26.38 |
| SALAD | 8448 (256+8192) | 31.60 | 42.64 | 47.20 | 62.59 | 14.41 | 19.50 | 21.73 | 29.80 |
| Multi-Vector SALAD | 8320 ((64+1)*128) | 34.11 | 45.24 | 49.58 | 64.32 | 15.64 | 20.83 | 23.19 | 31.64 |
Performance using DINOv2-ViTB14 as a backbone across various state-of-the-art visual place recognition pooling and aggregation strategies.