-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1026ed2
commit 647342c
Showing
4 changed files
with
637 additions
and
59 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,72 +1,44 @@ | ||
# M²: Meshed-Memory Transformer | ||
This repository contains the reference code for the paper _[M²: Meshed-Memory Transformer for Image Captioning](https://arxiv.org/abs/1912.08226)_. | ||
# Duel-Level Collaborative Transformer for Image Captioning | ||
This repository contains the reference code for the paper [Duel-Level Collaborative Transformer for Image Captioning](https://arxiv.org/pdf/2101.06462.pdf). | ||
|
||
<p align="center"> | ||
<img src="images/m2.png" alt="Meshed-Memory Transformer" width="320"/> | ||
</p> | ||
|
||
## Environment setup | ||
Clone the repository and create the `m2release` conda environment using the `environment.yml` file: | ||
``` | ||
conda env create -f environment.yml | ||
conda activate m2release | ||
``` | ||
|
||
Then download spacy data by executing the following command: | ||
``` | ||
python -m spacy download en | ||
``` | ||
|
||
Note: Python 3.6 is required to run our code. | ||
![](https://github.com/luo3300612/image-captioning-DLCT/raw/master/images/arch.png) | ||
|
||
## Experiment setup | ||
please refer to [m2 transformer](https://github.com/aimagelab/meshed-memory-transformer) | ||
|
||
## Data preparation | ||
To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file [annotations.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing) and extract it. | ||
* **Annotation**. Download the annotation file [annotation.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing) | ||
* **Feature**. You can download our ResNeXt-101 feature (.hdf5 file) [here](https://pan.baidu.com/s/188xmv2r5eXUbEUqKSA4BCw). Access code: etrx. | ||
|
||
Detection features are computed with the code provided by [1]. To reproduce our result, please download the COCO features file [coco_detections.hdf5](https://drive.google.com/open?id=1MV6dSnqViQfyvgyHrmAT_lLpFbkzp3mx) (~53.5 GB), in which detections of each image are stored under the `<image_id>_features` key. `<image_id>` is the id of each COCO image, without leading zeros (e.g. the `<image_id>` for `COCO_val2014_000000037209.jpg` is `37209`), and each value should be a `(N, 2048)` tensor, where `N` is the number of detections. | ||
There are five kinds of keys in our .hdf5 file. They are | ||
|
||
* `['%d_features' % image_id]`: region features (N_regions, feature_dim) | ||
* `['%d_boxes' % image_id]`: bounding box of region features (N_regions, 4) | ||
* `['%d_size' % image_id]`: size of original image (for normalizing bounding box), (2,) | ||
* `['%d_grids' % image_id]`: grid features (N_grids, feature_dim) | ||
* `['%d_mask' % image_id]`: geometric alignment graph, (N_regions, N_grids) | ||
|
||
## Evaluation | ||
To reproduce the results reported in our paper, download the pretrained model file [meshed_memory_transformer.pth](https://drive.google.com/file/d/1naUSnVqXSMIdoiNz_fjqKwh9tXF7x8Nx/view?usp=sharing) and place it in the code folder. | ||
We extract feature with the code in [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa). | ||
|
||
Run `python test.py` using the following arguments: | ||
The first three keys can be obtained when extracting region features with [extract_region_feature.py](./others/extract_region_feature.py). | ||
The forth key can be obtained when extracting grid features with code in [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa). | ||
The last key can be obtained with [align.ipynb](./align/align.ipynb) | ||
|
||
| Argument | Possible values | | ||
|------|------| | ||
| `--batch_size` | Batch size (default: 10) | | ||
| `--workers` | Number of workers (default: 0) | | ||
| `--features_path` | Path to detection features file | | ||
| `--annotation_folder` | Path to folder with COCO annotations | | ||
## Training | ||
|
||
#### Expected output | ||
Under `output_logs/`, you may also find the expected output of the evaluation code. | ||
|
||
|
||
## Training procedure | ||
Run `python train.py` using the following arguments: | ||
|
||
| Argument | Possible values | | ||
|------|------| | ||
| `--exp_name` | Experiment name| | ||
| `--batch_size` | Batch size (default: 10) | | ||
| `--workers` | Number of workers (default: 0) | | ||
| `--m` | Number of memory vectors (default: 40) | | ||
| `--head` | Number of heads (default: 8) | | ||
| `--warmup` | Warmup value for learning rate scheduling (default: 10000) | | ||
| `--resume_last` | If used, the training will be resumed from the last checkpoint. | | ||
| `--resume_best` | If used, the training will be resumed from the best checkpoint. | | ||
| `--features_path` | Path to detection features file | | ||
| `--annotation_folder` | Path to folder with COCO annotations | | ||
| `--logs_folder` | Path folder for tensorboard logs (default: "tensorboard_logs")| | ||
|
||
For example, to train our model with the parameters used in our experiments, use | ||
``` | ||
python train.py --exp_name m2_transformer --batch_size 50 --m 40 --head 8 --warmup 10000 --features_path /path/to/features --annotation_folder /path/to/annotations | ||
## Evaluation | ||
```python | ||
python eval.py --annotation annotation --workers 4 --features_path ./data/coco_all_align.hdf5 --model_path path_of_model_to_eval.pth --model DLCT --image_field ImageAllFieldWithMask --grid_embed --box_embed --dump_json gen_res.json --beam_size 5 | ||
``` | ||
Important args: | ||
* `--features_path` path to hdf5 file | ||
* `--model_path` | ||
* `--dump_json` dump generated captions to | ||
|
||
|
||
<p align="center"> | ||
<img src="images/results.png" alt="Sample Results" width="850"/> | ||
</p> | ||
## References | ||
[1] [M2](https://github.com/aimagelab/meshed-memory-transformer) | ||
|
||
#### References | ||
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In _Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition_, 2018. | ||
[2] [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa) | ||
## Acknowledgements | ||
Thanks the original [m2](https://github.com/aimagelab/meshed-memory-transformer) and amazing work of [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa). |
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,269 @@ | ||
import argparse | ||
import os | ||
import torch | ||
import tqdm | ||
from fvcore.common.file_io import PathManager | ||
# import pdb | ||
|
||
from detectron2.checkpoint import DetectionCheckpointer | ||
from detectron2.config import get_cfg | ||
from detectron2.engine import default_setup | ||
from detectron2.evaluation import inference_context | ||
from detectron2.modeling import build_model | ||
import numpy as np | ||
|
||
from grid_feats import ( | ||
add_attribute_config, | ||
build_detection_test_loader_with_attributes, | ||
) | ||
|
||
os.environ['CUDA_VISIBLE_DEVICES'] = '0' | ||
os.environ["HDF5_USE_FILE_LOCKING"] = 'FALSE' | ||
|
||
# A simple mapper from object detection dataset to VQA dataset names | ||
dataset_to_folder_mapper = {} | ||
dataset_to_folder_mapper['coco_2014_train'] = 'train2014' | ||
dataset_to_folder_mapper['coco_2014_val'] = 'val2014' | ||
# One may need to change the Detectron2 code to support coco_2015_test | ||
# insert "coco_2015_test": ("coco/test2015", "coco/annotations/image_info_test2015.json"), | ||
# at: https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/datasets/builtin.py#L36 | ||
dataset_to_folder_mapper['coco_2014_test'] = 'test2014' | ||
|
||
def extract_grid_feature_argument_parser(): | ||
parser = argparse.ArgumentParser(description="Grid feature extraction") | ||
parser.add_argument("--config-file", default="", metavar="FILE", help="path to config file") | ||
parser.add_argument("--dataset", help="name of the dataset", default="coco_2014_train", | ||
choices=['coco_2014_train', 'coco_2014_val', 'coco_2015_test']) | ||
parser.add_argument( | ||
"opts", | ||
help="Modify config options using the command-line", | ||
default=None, | ||
nargs=argparse.REMAINDER, | ||
) | ||
return parser | ||
|
||
def extract_grid_feature_on_dataset(model, data_loader, dump_folder): | ||
for idx, inputs in enumerate(tqdm.tqdm(data_loader)): | ||
with torch.no_grad(): | ||
image_id = inputs[0]['image_id'] | ||
file_name = '%d.pth' % image_id | ||
# compute features | ||
images = model.preprocess_image(inputs) | ||
features = model.backbone(images.tensor) | ||
outputs = model.roi_heads.get_conv5_features(features) | ||
with PathManager.open(os.path.join(dump_folder, file_name), "wb") as f: | ||
# save as CPU tensors | ||
torch.save(outputs.cpu(), f) | ||
|
||
def do_feature_extraction(cfg, model, dataset_name): | ||
with inference_context(model): | ||
dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name]) | ||
PathManager.mkdirs(dump_folder) | ||
data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name) | ||
extract_grid_feature_on_dataset(model, data_loader, dump_folder) | ||
|
||
def setup(args): | ||
""" | ||
Create configs and perform basic setups. | ||
""" | ||
cfg = get_cfg() | ||
add_attribute_config(cfg) | ||
cfg.merge_from_file(args.config_file) | ||
cfg.merge_from_list(args.opts) | ||
# force the final residual block to have dilations 1 | ||
# cfg.MODEL.RESNETS.RES5_DILATION = 1 | ||
cfg.MODEL.WEIGHTS = 'output_X101/X-101.pth' | ||
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.05 # I do thresh filter in my code | ||
cfg.freeze() | ||
default_setup(cfg, args) | ||
return cfg | ||
|
||
def resetup(args): | ||
""" | ||
Create configs and perform basic setups. | ||
""" | ||
cfg = get_cfg() | ||
add_attribute_config(cfg) | ||
cfg.merge_from_file(args.config_file) | ||
cfg.merge_from_list(args.opts) | ||
# force the final residual block to have dilations 1 | ||
# cfg.MODEL.RESNETS.RES5_DILATION = 1 | ||
cfg.MODEL.WEIGHTS = 'output_X101/X-101.pth' | ||
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0 # I do thresh filter in my code | ||
cfg.freeze() | ||
default_setup(cfg, args) | ||
return cfg | ||
|
||
def main(args): | ||
cfg = setup(args) | ||
model = build_model(cfg) | ||
DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load( | ||
cfg.MODEL.WEIGHTS, resume=True | ||
) | ||
do_feature_extraction(cfg, model, args.dataset) | ||
|
||
|
||
|
||
|
||
args = extract_grid_feature_argument_parser().parse_args('--config-file configs/X-101-grid.yaml --dataset coco_2014_train'.split()) | ||
|
||
|
||
cfg = setup(args) | ||
print(cfg) | ||
model = build_model(cfg) | ||
DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load( | ||
cfg.MODEL.WEIGHTS, resume=True | ||
) | ||
|
||
import h5py | ||
import os | ||
from detectron2.structures import Boxes | ||
|
||
|
||
save_dir = '/home/luoyp/disk1/grid-feats-vqa/feats' | ||
region_before = h5py.File(os.path.join(save_dir,'region_before_X152.hdf5'),'w') | ||
# region_after = h5py.File(os.path.join(save_dir,'region_after.hdf5'),'w') | ||
# grid7 = h5py.File(os.path.join(save_dir,'my_grid7.hdf5'),'w') | ||
# original_grid = h5py.File(os.path.join(save_dir,'original_grid7.hdf5'),'w') | ||
|
||
thresh = 0.2 | ||
max_regions = 100 | ||
pooling = torch.nn.AdaptiveAvgPool2d((7,7)) | ||
image_id_collector = [] | ||
for dataset_name in ['coco_2014_train','coco_2014_val']: | ||
with inference_context(model): | ||
dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name]) | ||
PathManager.mkdirs(dump_folder) | ||
data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name) | ||
for idx, inputs in enumerate(tqdm.tqdm(data_loader)): | ||
with torch.no_grad(): | ||
image_id = inputs[0]['image_id'] | ||
file_name = '%d.pth' % image_id | ||
images = model.preprocess_image(inputs) | ||
features = model.backbone(images.tensor) | ||
|
||
proposals, _ = model.proposal_generator(images, features) | ||
proposal_boxes = [x.proposal_boxes for x in proposals] | ||
|
||
features = [features[f] for f in model.roi_heads.in_features] | ||
box_features1 = model.roi_heads.box_pooler(features, [x.proposal_boxes for x in proposals]) | ||
box_features = model.roi_heads.box_head(box_features1) | ||
|
||
predictions = model.roi_heads.box_predictor(box_features) | ||
pred_instances, index = model.roi_heads.box_predictor.inference(predictions, proposals) | ||
|
||
|
||
topk = 10 | ||
scores = pred_instances[0].get_fields()['scores'] | ||
topk_index = index[0][:topk] | ||
|
||
thresh_mask = scores > thresh | ||
thresh_index = index[0][thresh_mask] | ||
|
||
if len(thresh_index) < 10: | ||
index = [topk_index] | ||
elif len(thresh_index) > max_regions: | ||
index = [thresh_index[:max_regions]] | ||
else: | ||
index = [thresh_index] | ||
|
||
if len(topk_index) < 10: | ||
print("{} has less than 10 regions!!!".format(image_id)) | ||
image_id_collector.append(image_id) | ||
continue | ||
|
||
# feature of proposal | ||
proposal_box_features1 = box_features1[index].mean(dim=[2,3]) | ||
proposal_box_features = box_features[index] | ||
# pdb.set_trace() | ||
boxes = pred_instances[0].get_fields()['pred_boxes'].tensor[:len(index[0])] | ||
|
||
image_size = pred_instances[0].image_size | ||
|
||
assert boxes.shape[0] == proposal_box_features.shape[0] | ||
|
||
region_before.create_dataset('{}_features'.format(image_id),data=proposal_box_features1.cpu().numpy()) | ||
region_before.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy()) | ||
region_before.create_dataset('{}_size'.format(image_id),data=np.array([image_size])) | ||
|
||
# region_after.create_dataset('{}_features'.format(image_id),data=proposal_box_features.cpu().numpy()) | ||
# region_after.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy()) | ||
# region_after.create_dataset('{}_size'.format(image_id),data=np.array([image_size])) | ||
|
||
|
||
|
||
del cfg | ||
del model | ||
|
||
cfg = resetup(args) | ||
print(cfg) | ||
model = build_model(cfg) | ||
DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load( | ||
cfg.MODEL.WEIGHTS, resume=True | ||
) | ||
|
||
print('problem images:') | ||
print(image_id_collector) | ||
|
||
for dataset_name in ['coco_2014_train','coco_2014_val']: | ||
with inference_context(model): | ||
dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name]) | ||
PathManager.mkdirs(dump_folder) | ||
data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name) | ||
for idx, inputs in enumerate(tqdm.tqdm(data_loader)): | ||
with torch.no_grad(): | ||
image_id = inputs[0]['image_id'] | ||
if image_id not in image_id_collector: | ||
continue | ||
print('append image:',image_id) | ||
file_name = '%d.pth' % image_id | ||
images = model.preprocess_image(inputs) | ||
features = model.backbone(images.tensor) | ||
|
||
proposals, _ = model.proposal_generator(images, features) | ||
proposal_boxes = [x.proposal_boxes for x in proposals] | ||
|
||
features = [features[f] for f in model.roi_heads.in_features] | ||
box_features1 = model.roi_heads.box_pooler(features, [x.proposal_boxes for x in proposals]) | ||
box_features = model.roi_heads.box_head(box_features1) | ||
|
||
predictions = model.roi_heads.box_predictor(box_features) | ||
pred_instances, index = model.roi_heads.box_predictor.inference(predictions, proposals) | ||
|
||
topk = 10 | ||
scores = pred_instances[0].get_fields()['scores'] | ||
topk_index = index[0][:topk] | ||
|
||
thresh_mask = scores > thresh | ||
thresh_index = index[0][thresh_mask] | ||
|
||
# if len(thresh_index) < 10: | ||
index = [topk_index] | ||
if len(topk_index) > max_regions: | ||
index = [topk_index[:max_regions]] | ||
|
||
if len(topk_index) < 10: | ||
print("{} has less than 10 regions!!!".format(image_id)) | ||
raise | ||
|
||
# feature of proposal | ||
proposal_box_features1 = box_features1[index].mean(dim=[2,3]) | ||
proposal_box_features = box_features[index] | ||
|
||
boxes = pred_instances[0].get_fields()['pred_boxes'].tensor[:len(index[0])] | ||
image_size = pred_instances[0].image_size | ||
|
||
assert boxes.shape[0] == proposal_box_features.shape[0] | ||
|
||
region_before.create_dataset('{}_features'.format(image_id),data=proposal_box_features1.cpu().numpy()) | ||
region_before.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy()) | ||
region_before.create_dataset('{}_size'.format(image_id),data=np.array([image_size])) | ||
|
||
# region_after.create_dataset('{}_features'.format(image_id),data=proposal_box_features.cpu().numpy()) | ||
# region_after.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy()) | ||
# region_after.create_dataset('{}_size'.format(image_id),data=np.array([image_size])) | ||
|
||
region_before.close() | ||
# region_after.close() | ||
# grid7.close() | ||
# original_grid.close() |