extra code and image

luo3300612 · Feb 21, 2021 · 647342c · 647342c
1 parent 1026ed2
commit 647342c
Show file tree

Hide file tree

Showing 4 changed files with 637 additions and 59 deletions.
diff --git a/README.md b/README.md
@@ -1,72 +1,44 @@
-# M²: Meshed-Memory Transformer
-This repository contains the reference code for the paper _[M²: Meshed-Memory Transformer for Image Captioning](https://arxiv.org/abs/1912.08226)_.
+# Duel-Level Collaborative Transformer for Image Captioning
+This repository contains the reference code for the paper [Duel-Level Collaborative Transformer for Image Captioning](https://arxiv.org/pdf/2101.06462.pdf).
 
-<p align="center">
-  <img src="images/m2.png" alt="Meshed-Memory Transformer" width="320"/>
-</p>
-
-## Environment setup
-Clone the repository and create the `m2release` conda environment using the `environment.yml` file:
-```
-conda env create -f environment.yml
-conda activate m2release
-```
-
-Then download spacy data by executing the following command:
-```
-python -m spacy download en
-```
-
-Note: Python 3.6 is required to run our code. 
+![](https://github.com/luo3300612/image-captioning-DLCT/raw/master/images/arch.png)
 
+## Experiment setup
+please refer to [m2 transformer](https://github.com/aimagelab/meshed-memory-transformer)
 
 ## Data preparation
-To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file [annotations.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing) and extract it.
+* **Annotation**. Download the annotation file [annotation.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing)
+* **Feature**. You can download our ResNeXt-101 feature (.hdf5 file) [here](https://pan.baidu.com/s/188xmv2r5eXUbEUqKSA4BCw). Access code: etrx.
 
-Detection features are computed with the code provided by [1]. To reproduce our result, please download the COCO features file [coco_detections.hdf5](https://drive.google.com/open?id=1MV6dSnqViQfyvgyHrmAT_lLpFbkzp3mx) (~53.5 GB), in which detections of each image are stored under the `<image_id>_features` key. `<image_id>` is the id of each COCO image, without leading zeros (e.g. the `<image_id>` for `COCO_val2014_000000037209.jpg` is `37209`), and each value should be a `(N, 2048)` tensor, where `N` is the number of detections. 
+There are five kinds of keys in our .hdf5 file. They are
 
+* `['%d_features' % image_id]`: region features (N_regions, feature_dim)
+* `['%d_boxes' % image_id]`: bounding box of region features (N_regions, 4)
+* `['%d_size' % image_id]`: size of original image (for normalizing bounding box), (2,)
+* `['%d_grids' % image_id]`: grid features (N_grids, feature_dim)
+* `['%d_mask' % image_id]`: geometric alignment graph, (N_regions, N_grids)
 
-## Evaluation
-To reproduce the results reported in our paper, download the pretrained model file [meshed_memory_transformer.pth](https://drive.google.com/file/d/1naUSnVqXSMIdoiNz_fjqKwh9tXF7x8Nx/view?usp=sharing) and place it in the code folder.
+We extract feature with the code in [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa).
 
-Run `python test.py` using the following arguments:
+The first three keys can be obtained when extracting region features with [extract_region_feature.py](./others/extract_region_feature.py).
+The forth key can be obtained when extracting grid features with code in [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa).
+The last key can be obtained with [align.ipynb](./align/align.ipynb)
 
-| Argument | Possible values |
-|------|------|
-| `--batch_size` | Batch size (default: 10) |
-| `--workers` | Number of workers (default: 0) |
-| `--features_path` | Path to detection features file |
-| `--annotation_folder` | Path to folder with COCO annotations |
+## Training
 
-#### Expected output
-Under `output_logs/`, you may also find the expected output of the evaluation code.
-
-
-## Training procedure
-Run `python train.py` using the following arguments:
-
-| Argument | Possible values |
-|------|------|
-| `--exp_name` | Experiment name|
-| `--batch_size` | Batch size (default: 10) |
-| `--workers` | Number of workers (default: 0) |
-| `--m` | Number of memory vectors (default: 40) |
-| `--head` | Number of heads (default: 8) |
-| `--warmup` | Warmup value for learning rate scheduling (default: 10000) |
-| `--resume_last` | If used, the training will be resumed from the last checkpoint. |
-| `--resume_best` | If used, the training will be resumed from the best checkpoint. |
-| `--features_path` | Path to detection features file |
-| `--annotation_folder` | Path to folder with COCO annotations |
-| `--logs_folder` | Path folder for tensorboard logs (default: "tensorboard_logs")|
-
-For example, to train our model with the parameters used in our experiments, use
-```
-python train.py --exp_name m2_transformer --batch_size 50 --m 40 --head 8 --warmup 10000 --features_path /path/to/features --annotation_folder /path/to/annotations
+## Evaluation
+```python
+python eval.py --annotation annotation --workers 4 --features_path ./data/coco_all_align.hdf5 --model_path path_of_model_to_eval.pth --model DLCT --image_field ImageAllFieldWithMask --grid_embed --box_embed --dump_json gen_res.json --beam_size 5
 ```
+Important args:
+* `--features_path` path to hdf5 file
+* `--model_path`
+* `--dump_json` dump generated captions to
+
 
-<p align="center">
-  <img src="images/results.png" alt="Sample Results" width="850"/>
-</p>
+## References
+[1] [M2](https://github.com/aimagelab/meshed-memory-transformer)
 
-#### References
-[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In _Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition_, 2018.
+[2] [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa)
+## Acknowledgements
+Thanks the original [m2](https://github.com/aimagelab/meshed-memory-transformer) and amazing work of [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa). 
diff --git a/align/align.ipynb b/align/align.ipynb
diff --git a/images/arch.png b/images/arch.png
diff --git a/others/extract_region_feature.py b/others/extract_region_feature.py
@@ -0,0 +1,269 @@
+import argparse
+import os
+import torch
+import tqdm
+from fvcore.common.file_io import PathManager
+# import pdb
+
+from detectron2.checkpoint import DetectionCheckpointer
+from detectron2.config import get_cfg
+from detectron2.engine import default_setup
+from detectron2.evaluation import inference_context
+from detectron2.modeling import build_model
+import numpy as np
+
+from grid_feats import (
+    add_attribute_config,
+    build_detection_test_loader_with_attributes,
+)
+
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+os.environ["HDF5_USE_FILE_LOCKING"] = 'FALSE'
+
+# A simple mapper from object detection dataset to VQA dataset names
+dataset_to_folder_mapper = {}
+dataset_to_folder_mapper['coco_2014_train'] = 'train2014'
+dataset_to_folder_mapper['coco_2014_val'] = 'val2014'
+# One may need to change the Detectron2 code to support coco_2015_test
+# insert "coco_2015_test": ("coco/test2015", "coco/annotations/image_info_test2015.json"),
+# at: https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/datasets/builtin.py#L36
+dataset_to_folder_mapper['coco_2014_test'] = 'test2014'
+
+def extract_grid_feature_argument_parser():
+    parser = argparse.ArgumentParser(description="Grid feature extraction")
+    parser.add_argument("--config-file", default="", metavar="FILE", help="path to config file")
+    parser.add_argument("--dataset", help="name of the dataset", default="coco_2014_train",
+                        choices=['coco_2014_train', 'coco_2014_val', 'coco_2015_test'])
+    parser.add_argument(
+        "opts",
+        help="Modify config options using the command-line",
+        default=None,
+        nargs=argparse.REMAINDER,
+    )
+    return parser
+
+def extract_grid_feature_on_dataset(model, data_loader, dump_folder):
+    for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
+        with torch.no_grad():
+            image_id = inputs[0]['image_id']
+            file_name = '%d.pth' % image_id
+            # compute features
+            images = model.preprocess_image(inputs)
+            features = model.backbone(images.tensor)
+            outputs = model.roi_heads.get_conv5_features(features)
+            with PathManager.open(os.path.join(dump_folder, file_name), "wb") as f:
+                # save as CPU tensors
+                torch.save(outputs.cpu(), f)
+
+def do_feature_extraction(cfg, model, dataset_name):
+    with inference_context(model):
+        dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name])
+        PathManager.mkdirs(dump_folder)
+        data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name)
+        extract_grid_feature_on_dataset(model, data_loader, dump_folder)
+
+def setup(args):
+    """
+    Create configs and perform basic setups.
+    """
+    cfg = get_cfg()
+    add_attribute_config(cfg)
+    cfg.merge_from_file(args.config_file)
+    cfg.merge_from_list(args.opts)
+    # force the final residual block to have dilations 1
+#     cfg.MODEL.RESNETS.RES5_DILATION = 1
+    cfg.MODEL.WEIGHTS = 'output_X101/X-101.pth'
+    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.05 # I do thresh filter in my code
+    cfg.freeze()
+    default_setup(cfg, args)
+    return cfg
+
+def resetup(args):
+    """
+    Create configs and perform basic setups.
+    """
+    cfg = get_cfg()
+    add_attribute_config(cfg)
+    cfg.merge_from_file(args.config_file)
+    cfg.merge_from_list(args.opts)
+    # force the final residual block to have dilations 1
+#     cfg.MODEL.RESNETS.RES5_DILATION = 1
+    cfg.MODEL.WEIGHTS = 'output_X101/X-101.pth'
+    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0 # I do thresh filter in my code
+    cfg.freeze()
+    default_setup(cfg, args)
+    return cfg
+
+def main(args):
+    cfg = setup(args)
+    model = build_model(cfg)
+    DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
+        cfg.MODEL.WEIGHTS, resume=True
+    )
+    do_feature_extraction(cfg, model, args.dataset)
+
+
+
+
+args = extract_grid_feature_argument_parser().parse_args('--config-file configs/X-101-grid.yaml --dataset coco_2014_train'.split())
+
+
+cfg = setup(args)
+print(cfg)
+model = build_model(cfg)
+DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
+    cfg.MODEL.WEIGHTS, resume=True
+)
+
+import h5py
+import os
+from detectron2.structures import Boxes
+
+
+save_dir = '/home/luoyp/disk1/grid-feats-vqa/feats'
+region_before = h5py.File(os.path.join(save_dir,'region_before_X152.hdf5'),'w')
+# region_after = h5py.File(os.path.join(save_dir,'region_after.hdf5'),'w')
+# grid7 = h5py.File(os.path.join(save_dir,'my_grid7.hdf5'),'w')
+# original_grid = h5py.File(os.path.join(save_dir,'original_grid7.hdf5'),'w')
+
+thresh = 0.2
+max_regions = 100
+pooling = torch.nn.AdaptiveAvgPool2d((7,7))
+image_id_collector = []
+for dataset_name in ['coco_2014_train','coco_2014_val']:
+    with inference_context(model):
+        dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name])
+        PathManager.mkdirs(dump_folder)
+        data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name)
+        for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
+            with torch.no_grad():
+                image_id = inputs[0]['image_id']
+                file_name = '%d.pth' % image_id
+                images = model.preprocess_image(inputs)
+                features = model.backbone(images.tensor)
+
+                proposals, _ = model.proposal_generator(images, features)
+                proposal_boxes = [x.proposal_boxes for x in proposals]
+
+                features = [features[f] for f in model.roi_heads.in_features]
+                box_features1 = model.roi_heads.box_pooler(features, [x.proposal_boxes for x in proposals])
+                box_features = model.roi_heads.box_head(box_features1)
+
+                predictions = model.roi_heads.box_predictor(box_features)
+                pred_instances, index = model.roi_heads.box_predictor.inference(predictions, proposals)
+
+
+                topk = 10
+                scores = pred_instances[0].get_fields()['scores']
+                topk_index = index[0][:topk]
+
+                thresh_mask = scores > thresh
+                thresh_index = index[0][thresh_mask]
+
+                if len(thresh_index) < 10:
+                    index = [topk_index]
+                elif len(thresh_index) > max_regions:
+                    index = [thresh_index[:max_regions]]
+                else:
+                    index = [thresh_index]
+
+                if len(topk_index) < 10:
+                    print("{} has less than 10 regions!!!".format(image_id))
+                    image_id_collector.append(image_id)
+                    continue
+
+                # feature of proposal
+                proposal_box_features1 = box_features1[index].mean(dim=[2,3])
+                proposal_box_features = box_features[index]
+#                 pdb.set_trace()
+                boxes = pred_instances[0].get_fields()['pred_boxes'].tensor[:len(index[0])]
+
+                image_size = pred_instances[0].image_size
+
+                assert boxes.shape[0] == proposal_box_features.shape[0]
+
+                region_before.create_dataset('{}_features'.format(image_id),data=proposal_box_features1.cpu().numpy())
+                region_before.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
+                region_before.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))
+
+#                 region_after.create_dataset('{}_features'.format(image_id),data=proposal_box_features.cpu().numpy())
+#                 region_after.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
+#                 region_after.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))
+
+
+
+del cfg
+del model
+
+cfg = resetup(args)
+print(cfg)
+model = build_model(cfg)
+DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
+    cfg.MODEL.WEIGHTS, resume=True
+)
+
+print('problem images:')
+print(image_id_collector)
+
+for dataset_name in ['coco_2014_train','coco_2014_val']:
+    with inference_context(model):
+        dump_folder = os.path.join(cfg.OUTPUT_DIR, "features", dataset_to_folder_mapper[dataset_name])
+        PathManager.mkdirs(dump_folder)
+        data_loader = build_detection_test_loader_with_attributes(cfg, dataset_name)
+        for idx, inputs in enumerate(tqdm.tqdm(data_loader)):
+            with torch.no_grad():
+                image_id = inputs[0]['image_id']
+                if image_id not in image_id_collector:
+                    continue
+                print('append image:',image_id)
+                file_name = '%d.pth' % image_id
+                images = model.preprocess_image(inputs)
+                features = model.backbone(images.tensor)
+
+                proposals, _ = model.proposal_generator(images, features)
+                proposal_boxes = [x.proposal_boxes for x in proposals]
+
+                features = [features[f] for f in model.roi_heads.in_features]
+                box_features1 = model.roi_heads.box_pooler(features, [x.proposal_boxes for x in proposals])
+                box_features = model.roi_heads.box_head(box_features1)
+
+                predictions = model.roi_heads.box_predictor(box_features)
+                pred_instances, index = model.roi_heads.box_predictor.inference(predictions, proposals)
+
+                topk = 10
+                scores = pred_instances[0].get_fields()['scores']
+                topk_index = index[0][:topk]
+
+                thresh_mask = scores > thresh
+                thresh_index = index[0][thresh_mask]
+
+#                 if len(thresh_index) < 10:
+                index = [topk_index]
+                if len(topk_index) > max_regions:
+                    index = [topk_index[:max_regions]]
+
+                if len(topk_index) < 10:
+                    print("{} has less than 10 regions!!!".format(image_id))
+                    raise
+
+                # feature of proposal
+                proposal_box_features1 = box_features1[index].mean(dim=[2,3])
+                proposal_box_features = box_features[index]
+
+                boxes = pred_instances[0].get_fields()['pred_boxes'].tensor[:len(index[0])]
+                image_size = pred_instances[0].image_size
+
+                assert boxes.shape[0] == proposal_box_features.shape[0]
+
+                region_before.create_dataset('{}_features'.format(image_id),data=proposal_box_features1.cpu().numpy())
+                region_before.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
+                region_before.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))
+
+#                 region_after.create_dataset('{}_features'.format(image_id),data=proposal_box_features.cpu().numpy())
+#                 region_after.create_dataset('{}_boxes'.format(image_id),data=boxes.cpu().numpy())
+#                 region_after.create_dataset('{}_size'.format(image_id),data=np.array([image_size]))
+
+region_before.close()
+# region_after.close()
+# grid7.close()
+# original_grid.close()