For all the datasets the captions are already present in this repo. The images should be downloaded.
Can be found here https://github.com/201528014227051/RSICD_optimal. The folder structure is the following:
| data
| ├── rsicd
│ │ ├── RSICD_images
│ │ │ ├── 00001.jpg
│ │ │ ├── 00002.jpg
│ │ │ ├── ...
│ │ ├── dataset_rsicd.json
Can be found here https://github.com/201528014227051/RSICD_optimal. The folder structure is the following:
| data
| ├── ucm
│ │ ├── images
│ │ │ ├── 1.tif
│ │ │ ├── 2.tif
│ │ │ ├── ...
│ │ ├── dataset.json
Can be found here https://github.com/201528014227051/RSICD_optimal. The folder structure is the following:
| data
| ├── sydney
│ │ ├── images
│ │ │ ├── 1.tif
│ │ │ ├── 2.tif
│ │ │ ├── ...
│ │ ├── filenames
│ │ │ ├── descriptions_SYDNEY.txt
│ │ │ ├── filenames_test.txt
│ │ │ ├── filenames_train.txt
│ │ │ ├── filenames_val.txt
Can be found here: https://github.com/HaiyanHuang98/NWPU-Captions. The folder structure is the following:
| data
| ├── nwpu
│ │ ├── images
│ │ │ ├── airplane
│ │ │ │ ├── airplane_001.jpg
│ │ │ │ ├── ...
│ │ │ ├── bridge
│ │ │ │ ├── bridge_001.jpg
│ │ │ │ ├── ...
│ │ │ ├── ...
│ │ ├── dataset_nwpu.json
To train using VGG16 as backbone encoder run:
python3 train_decoder.py --encoder=vgg
To train using RemoteCLIP backbone run:
python3 train_decoder.py --encoder=remote_clip
If you want to finetune clip with SEG-4 dataset run:
python train_clip.py
Then you can train the decoder model:
python train_decoder.py --encoder=clip
To evaluate the model. It will evaluate the model on each dataset and printing the metrics.
python evaluate.py
To generate captions for a specific image, you can use the following command:
python caption.py --image_path path/to/image.jpg