Skip to content

Commit

Permalink
extra code and image
Browse files Browse the repository at this point in the history
  • Loading branch information
luo3300612 committed Feb 21, 2021
1 parent 1026ed2 commit 647342c
Show file tree
Hide file tree
Showing 4 changed files with 637 additions and 59 deletions.
90 changes: 31 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,44 @@
# M²: Meshed-Memory Transformer
This repository contains the reference code for the paper _[M²: Meshed-Memory Transformer for Image Captioning](https://arxiv.org/abs/1912.08226)_.
# Duel-Level Collaborative Transformer for Image Captioning
This repository contains the reference code for the paper [Duel-Level Collaborative Transformer for Image Captioning](https://arxiv.org/pdf/2101.06462.pdf).

<p align="center">
<img src="images/m2.png" alt="Meshed-Memory Transformer" width="320"/>
</p>

## Environment setup
Clone the repository and create the `m2release` conda environment using the `environment.yml` file:
```
conda env create -f environment.yml
conda activate m2release
```

Then download spacy data by executing the following command:
```
python -m spacy download en
```

Note: Python 3.6 is required to run our code.
![](https://github.com/luo3300612/image-captioning-DLCT/raw/master/images/arch.png)

## Experiment setup
please refer to [m2 transformer](https://github.com/aimagelab/meshed-memory-transformer)

## Data preparation
To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file [annotations.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing) and extract it.
* **Annotation**. Download the annotation file [annotation.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing)
* **Feature**. You can download our ResNeXt-101 feature (.hdf5 file) [here](https://pan.baidu.com/s/188xmv2r5eXUbEUqKSA4BCw). Access code: etrx.

Detection features are computed with the code provided by [1]. To reproduce our result, please download the COCO features file [coco_detections.hdf5](https://drive.google.com/open?id=1MV6dSnqViQfyvgyHrmAT_lLpFbkzp3mx) (~53.5 GB), in which detections of each image are stored under the `<image_id>_features` key. `<image_id>` is the id of each COCO image, without leading zeros (e.g. the `<image_id>` for `COCO_val2014_000000037209.jpg` is `37209`), and each value should be a `(N, 2048)` tensor, where `N` is the number of detections.
There are five kinds of keys in our .hdf5 file. They are

* `['%d_features' % image_id]`: region features (N_regions, feature_dim)
* `['%d_boxes' % image_id]`: bounding box of region features (N_regions, 4)
* `['%d_size' % image_id]`: size of original image (for normalizing bounding box), (2,)
* `['%d_grids' % image_id]`: grid features (N_grids, feature_dim)
* `['%d_mask' % image_id]`: geometric alignment graph, (N_regions, N_grids)

## Evaluation
To reproduce the results reported in our paper, download the pretrained model file [meshed_memory_transformer.pth](https://drive.google.com/file/d/1naUSnVqXSMIdoiNz_fjqKwh9tXF7x8Nx/view?usp=sharing) and place it in the code folder.
We extract feature with the code in [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa).

Run `python test.py` using the following arguments:
The first three keys can be obtained when extracting region features with [extract_region_feature.py](./others/extract_region_feature.py).
The forth key can be obtained when extracting grid features with code in [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa).
The last key can be obtained with [align.ipynb](./align/align.ipynb)

| Argument | Possible values |
|------|------|
| `--batch_size` | Batch size (default: 10) |
| `--workers` | Number of workers (default: 0) |
| `--features_path` | Path to detection features file |
| `--annotation_folder` | Path to folder with COCO annotations |
## Training

#### Expected output
Under `output_logs/`, you may also find the expected output of the evaluation code.


## Training procedure
Run `python train.py` using the following arguments:

| Argument | Possible values |
|------|------|
| `--exp_name` | Experiment name|
| `--batch_size` | Batch size (default: 10) |
| `--workers` | Number of workers (default: 0) |
| `--m` | Number of memory vectors (default: 40) |
| `--head` | Number of heads (default: 8) |
| `--warmup` | Warmup value for learning rate scheduling (default: 10000) |
| `--resume_last` | If used, the training will be resumed from the last checkpoint. |
| `--resume_best` | If used, the training will be resumed from the best checkpoint. |
| `--features_path` | Path to detection features file |
| `--annotation_folder` | Path to folder with COCO annotations |
| `--logs_folder` | Path folder for tensorboard logs (default: "tensorboard_logs")|

For example, to train our model with the parameters used in our experiments, use
```
python train.py --exp_name m2_transformer --batch_size 50 --m 40 --head 8 --warmup 10000 --features_path /path/to/features --annotation_folder /path/to/annotations
## Evaluation
```python
python eval.py --annotation annotation --workers 4 --features_path ./data/coco_all_align.hdf5 --model_path path_of_model_to_eval.pth --model DLCT --image_field ImageAllFieldWithMask --grid_embed --box_embed --dump_json gen_res.json --beam_size 5
```
Important args:
* `--features_path` path to hdf5 file
* `--model_path`
* `--dump_json` dump generated captions to


<p align="center">
<img src="images/results.png" alt="Sample Results" width="850"/>
</p>
## References
[1] [M2](https://github.com/aimagelab/meshed-memory-transformer)

#### References
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In _Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition_, 2018.
[2] [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa)
## Acknowledgements
Thanks the original [m2](https://github.com/aimagelab/meshed-memory-transformer) and amazing work of [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa).
337 changes: 337 additions & 0 deletions align/align.ipynb

Large diffs are not rendered by default.

Binary file added images/arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
269 changes: 269 additions & 0 deletions others/extract_region_feature.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
import argparse
import os
import torch
import tqdm
from fvcore.common.file_io import PathManager
# import pdb

from detectron2.checkpoint import DetectionCheckpointer
from detectron2.config import get_cfg
from detectron2.engine import default_setup
from detectron2.evaluation import inference_context
from detectron2.modeling import build_model
import numpy as np

from grid_feats import (
add_attribute_config,
build_detection_test_loader_with_attributes,
)

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ["HDF5_USE_FILE_LOCKING"] = 'FALSE'

# A simple mapper from object detection dataset to VQA dataset names
dataset_to_folder_mapper = {}
dataset_to_folder_mapper['coco_2014_train'] = 'train2014'
dataset_to_folder_mapper['coco_2014_val'] = 'val2014'
# One may need to change the Detectron2 code to support coco_2015_test
# insert "coco_2015_test": ("coco/test2015", "coco/annotations/image_info_test2015.json"),
# at: https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/datasets/builtin.py#L36
dataset_to_folder_mapper['coco_2014_test'] = 'test2014'

def extract_grid_feature_argument_parser():
parser = argparse.ArgumentParser(description="Grid feature extraction")
parser.add_argument("--config-file", default="", metavar="FILE", help="path to config file")
parser.add_argument("--dataset", help="name of the dataset", default="coco_2014_train",
choices=['coco_2014_train', 'coco_2014_val', 'coco_2015_test'])
parser.add_argument(
"opts",
help="Modify config options using the command-line",
default=None,
nargs=argparse.REMAINDER,
)
return parser

def extract_grid_feature_on_dataset(model, data_loader, dump_folder):
for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
with torch.no_grad():
image_id = inputs[0]['image_id']
file_name = '%d.pth' % image_id
# compute features
images = model.preprocess_image(inputs)
features = model.backbone(images.tensor)
outputs = model.roi_heads.get_conv5_features(features)
with PathManager.open(os.path.join(dump_folder, file_name), "wb") as f:
# save as CPU tensors
torch.save(outputs.cpu(), f)

def do_feature_extraction(cfg, model, dataset_name):
with inference_context(model):
dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name])
PathManager.mkdirs(dump_folder)
data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name)
extract_grid_feature_on_dataset(model, data_loader, dump_folder)

def setup(args):
"""
Create configs and perform basic setups.
"""
cfg = get_cfg()
add_attribute_config(cfg)
cfg.merge_from_file(args.config_file)
cfg.merge_from_list(args.opts)
# force the final residual block to have dilations 1
# cfg.MODEL.RESNETS.RES5_DILATION = 1
cfg.MODEL.WEIGHTS = 'output_X101/X-101.pth'
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.05 # I do thresh filter in my code
cfg.freeze()
default_setup(cfg, args)
return cfg

def resetup(args):
"""
Create configs and perform basic setups.
"""
cfg = get_cfg()
add_attribute_config(cfg)
cfg.merge_from_file(args.config_file)
cfg.merge_from_list(args.opts)
# force the final residual block to have dilations 1
# cfg.MODEL.RESNETS.RES5_DILATION = 1
cfg.MODEL.WEIGHTS = 'output_X101/X-101.pth'
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0 # I do thresh filter in my code
cfg.freeze()
default_setup(cfg, args)
return cfg

def main(args):
cfg = setup(args)
model = build_model(cfg)
DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
cfg.MODEL.WEIGHTS, resume=True
)
do_feature_extraction(cfg, model, args.dataset)




args = extract_grid_feature_argument_parser().parse_args('--config-file configs/X-101-grid.yaml --dataset coco_2014_train'.split())


cfg = setup(args)
print(cfg)
model = build_model(cfg)
DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
cfg.MODEL.WEIGHTS, resume=True
)

import h5py
import os
from detectron2.structures import Boxes


save_dir = '/home/luoyp/disk1/grid-feats-vqa/feats'
region_before = h5py.File(os.path.join(save_dir,'region_before_X152.hdf5'),'w')
# region_after = h5py.File(os.path.join(save_dir,'region_after.hdf5'),'w')
# grid7 = h5py.File(os.path.join(save_dir,'my_grid7.hdf5'),'w')
# original_grid = h5py.File(os.path.join(save_dir,'original_grid7.hdf5'),'w')

thresh = 0.2
max_regions = 100
pooling = torch.nn.AdaptiveAvgPool2d((7,7))
image_id_collector = []
for dataset_name in ['coco_2014_train','coco_2014_val']:
with inference_context(model):
dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name])
PathManager.mkdirs(dump_folder)
data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name)
for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
with torch.no_grad():
image_id = inputs[0]['image_id']
file_name = '%d.pth' % image_id
images = model.preprocess_image(inputs)
features = model.backbone(images.tensor)

proposals, _ = model.proposal_generator(images, features)
proposal_boxes = [x.proposal_boxes for x in proposals]

features = [features[f] for f in model.roi_heads.in_features]
box_features1 = model.roi_heads.box_pooler(features, [x.proposal_boxes for x in proposals])
box_features = model.roi_heads.box_head(box_features1)

predictions = model.roi_heads.box_predictor(box_features)
pred_instances, index = model.roi_heads.box_predictor.inference(predictions, proposals)


topk = 10
scores = pred_instances[0].get_fields()['scores']
topk_index = index[0][:topk]

thresh_mask = scores > thresh
thresh_index = index[0][thresh_mask]

if len(thresh_index) < 10:
index = [topk_index]
elif len(thresh_index) > max_regions:
index = [thresh_index[:max_regions]]
else:
index = [thresh_index]

if len(topk_index) < 10:
print("{} has less than 10 regions!!!".format(image_id))
image_id_collector.append(image_id)
continue

# feature of proposal
proposal_box_features1 = box_features1[index].mean(dim=[2,3])
proposal_box_features = box_features[index]
# pdb.set_trace()
boxes = pred_instances[0].get_fields()['pred_boxes'].tensor[:len(index[0])]

image_size = pred_instances[0].image_size

assert boxes.shape[0] == proposal_box_features.shape[0]

region_before.create_dataset('{}_features'.format(image_id),data=proposal_box_features1.cpu().numpy())
region_before.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
region_before.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))

# region_after.create_dataset('{}_features'.format(image_id),data=proposal_box_features.cpu().numpy())
# region_after.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
# region_after.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))



del cfg
del model

cfg = resetup(args)
print(cfg)
model = build_model(cfg)
DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
cfg.MODEL.WEIGHTS, resume=True
)

print('problem images:')
print(image_id_collector)

for dataset_name in ['coco_2014_train','coco_2014_val']:
with inference_context(model):
dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name])
PathManager.mkdirs(dump_folder)
data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name)
for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
with torch.no_grad():
image_id = inputs[0]['image_id']
if image_id not in image_id_collector:
continue
print('append image:',image_id)
file_name = '%d.pth' % image_id
images = model.preprocess_image(inputs)
features = model.backbone(images.tensor)

proposals, _ = model.proposal_generator(images, features)
proposal_boxes = [x.proposal_boxes for x in proposals]

features = [features[f] for f in model.roi_heads.in_features]
box_features1 = model.roi_heads.box_pooler(features, [x.proposal_boxes for x in proposals])
box_features = model.roi_heads.box_head(box_features1)

predictions = model.roi_heads.box_predictor(box_features)
pred_instances, index = model.roi_heads.box_predictor.inference(predictions, proposals)

topk = 10
scores = pred_instances[0].get_fields()['scores']
topk_index = index[0][:topk]

thresh_mask = scores > thresh
thresh_index = index[0][thresh_mask]

# if len(thresh_index) < 10:
index = [topk_index]
if len(topk_index) > max_regions:
index = [topk_index[:max_regions]]

if len(topk_index) < 10:
print("{} has less than 10 regions!!!".format(image_id))
raise

# feature of proposal
proposal_box_features1 = box_features1[index].mean(dim=[2,3])
proposal_box_features = box_features[index]

boxes = pred_instances[0].get_fields()['pred_boxes'].tensor[:len(index[0])]
image_size = pred_instances[0].image_size

assert boxes.shape[0] == proposal_box_features.shape[0]

region_before.create_dataset('{}_features'.format(image_id),data=proposal_box_features1.cpu().numpy())
region_before.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
region_before.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))

# region_after.create_dataset('{}_features'.format(image_id),data=proposal_box_features.cpu().numpy())
# region_after.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
# region_after.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))

region_before.close()
# region_after.close()
# grid7.close()
# original_grid.close()

0 comments on commit 647342c

Please sign in to comment.