# Datasets pre-processing


We provide this notebook to enable pre-processing all datasets into a unified format. Please follow the instructions for the dataset you want to use below.

In [None]:
# Make sure we're in the correct directory
import os
cwd = os.getcwd()
if cwd.endswith('data'):
    os.chdir('../')
cwd = os.getcwd()
print(cwd) # This should be the root of the repository e.g. the `MultiGrounding` folder

# Visual Genome

## Download and prepare Visual Genome 

If you do not have the VisualGenome dataset yet:
1. download v1.2 images and region descriptions from [VisualGenome](https://visualgenome.org/api/v0/api_home.html)
2. put all images under folder: `VG_Images`

Otherwise, you can remove the directory `VG_Images` and create a symbolic link from your VisualGenome image directory (e.g. `/datasets/VisualGenome/`) to the `VG_Images` folder as follows:
`ln -s /datasets/VisualGenome/ VG_Images`

Then:
1. put [region_descriptions.json](https://visualgenome.org/static/data/dataset/region_descriptions.json.zip) under folder: `VG_Annotations`
2. make sure `imgs_data.pickle` (already included) is under `VG_Annotations`
3. make sure `data_splits.pickle` (already included) is under `VG_Splits`
 
NB: For fair comparison with SOTA
1. we make sure train split doesn't have overlap with any test split of other datasets
2. we take val set of coco as test split of vg and the rest of vg as train split

## Pre-process Visual Genome 

In [None]:
# Pre-processing
from data.data_utils import Data
data_saver = Data()
data_saver(raw_path='./data/visual_genome/',save_path='./data/visual_genome_pp/',name='visual_genome',store_lmdb=True)

# Flickr30K Entities

## Download and prepare Flickr30K Entities

1. Clone the master branch from [https://github.com/BryanPlummer/flickr30k_entities](https://github.com/BryanPlummer/flickr30k_entities)
2. Copy annotations.zip, in the `Flickr30k_Entities` folder and unzip it, this should generate and fill the subfolders `Sentences` and `Annotations`
3. Put `train.txt`, `test.txt`, and `val.txt` in the `Flickr30k_Splits` folder
4. Download flickr30k-images from [http://hockenmaier.cs.illinois.edu/DenotationGraph/](http://hockenmaier.cs.illinois.edu/DenotationGraph/) and extract the zip file, rename the folder
5. Put all images in `Flickr30k_Images` folder
6. Put folders `Flickr30k_Entities`, `Flickr30k_Images`, `Flickr30k_Splits` under `./data/flickr30k/`

In [None]:
# Pre-processing
from data.data_utils import Data
data_saver = Data()
data_saver(raw_path='./data/flickr30k/',save_path='./data/flickr30k_pp/',name='flickr30k_entities',store_lmdb=True)

# ReferIt

## Download and prepare ReferIt

1. Download the `refclef` split from [http://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip](http://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip)
2. Download the cleaned `refclef` images from [http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip](http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip)
3. Unzip files of `refclef.zip` to `ReferIt_Splits`, not as a subfolder `refclef` in `ReferIt_Splits` but with the files directly in the folder `ReferIt_Splits`
4. unzip files of `saiapr_tc-12.zip` to `ReferIt_Images`, not as a subfolder `saiapr_tc-12` in `ReferIt_Images` but with the folders `00` to `40` directly in the folder `ReferIt_Images`
5. Unzip files of `RefClef_Captions.tgz` (already included), this should create and fill the folder `ReferClef_Captions`

Note: you can choose to download and process RefCOCO splits, but we used RefClef under "UNC" split (an established split in the area) and already included proper "image_id"s in the repo

In [None]:
# Pre-processing
from data.data_utils import Data
data_saver = Data()
data_saver(raw_path='./data/referit/',save_path='./data/referit_pp/',name='referit',store_lmdb=True)

# MS-COCO

## Download and prepare MS-COCO

1. Clone and build python API of COCO dataset from [https://github.com/cocodataset/cocoapi/tree/master/PythonAPI](https://github.com/cocodataset/cocoapi/tree/master/PythonAPI) if not already installed
2. Download coco train/val images and annotations ([train](http://images.cocodataset.org/annotations/annotations_trainval2014.zip
) and [test](http://images.cocodataset.org/annotations/image_info_test2014.zip
)) from [http://cocodataset.org/#download](http://cocodataset.org/#download)
3. Unzip all splits and put all images in one folder named: `COCO_Images`
4. Unzip annotation files and put all files under them in `COCO_Annotations`


Note: we used version 2014 in our evaluations

Note: using 'gsutil' speeds up the process of downloading images (instructions available at [http://cocodataset.org/#download](http://cocodataset.org/#download))

In [None]:
# Pre-processing
from data.data_utils import Data
data_saver = Data()
data_saver(raw_path='./data/coco/',save_path='./data/coco_pp/',name='coco',store_lmdb=True,version='2014')