# Data preparation
---

<span style="color:red"><b>Note: Do NOT run any of these cells/scripts on UIO IFI computers! (Explanation below)  </b></span> 

In this mandaroty exercise you are implementing an image captioning network. For training and validation data you will need images with corresponding descriptions. The dataset that you will use is the "Common Object in Context" (COCO) 2017. You will also need pretrained weights form the VGG16 network.

If you are working on a UIO IFI computer data will be avaibale for you on the project disk. The *path* is given to you in the assigment. The dataset is large (~18GB) and every student cannot download it on the UIO IFI computers. It also takes too long time to produce VGG16 features which is needed for the imaging captioning task. However, if you are working on your own computer, you will need to follow the steps in this notebook to be able to complete the exercise. Downloading the dataset, generating the vocabulary and processing VGG16 features will take a long time. It will depend on your internet connection and compute power.

This notebook will help you with:
- Downloading and unzipping training and validation data from the COCO 2017 dataset
- Generating a vocabulary dictionary holding information about the captions and the corresponding tokens.
- Downloading and unzipping the VGG16 weights
- Produce and store features from the secound fully connected layer in the VGG16 network for all train and validation images.





Links:
- [Step1: Download COCO dataset](#Task1)
- [Step2: Generate vocabulary](#Task2)
- [Step3: Download VGG16 weights and produce VGG16 features](#Task3)


Software version:
- Python 3.6
- Pytorch 1.0


---


<a id='Task1'></a>
### Step1: Download COCO dataset

The data can be found in folder "data/coco". Subfolder e.g. "train2017" contains the training images as jpg files.


**Note**: If the process failed at some point, you may need to go into the "data/coco" folder and delete the files which were not downloaded correctly before trying again.


In [None]:
import os
from utils_data_preparation.cocoDataset import maybe_download_and_extract_coco, DataLoaderWrapper
from utils_data_preparation.produceVGG16_fc7_features import produceVGG16_fc7_features

os.environ["CUDA_VISIBLE_DEVICES"]="0"
device = "cuda"    #"cuda" or "cpu"
data_dir = "data/coco/"

# Download coco dataset
maybe_download_and_extract_coco(data_dir)



---

<a id='Task2'></a>
### Step2: Generate vocabulary ###


The vocabulary will be stored as a pickle file at "data/coco/vocabulary"

**Note**: If the process failed at some point, you may need to go into the "data/coco/vocabulary" folder and delete the file if it was not downloaded correctly before trying again.

In [None]:
# Generate dataloaders (train / val)
myDataLoader = DataLoaderWrapper(data_dir)

# Generate vocabulary
myDataLoader.generate_vocabulary()

---

<a id='Task3'></a>
### Step3: Download vgg16 weights and produce VGG16 features###

The pretrained weights will be stored in folder "data/coco/model/VGG16" as a .pth file

**Note**: If the process failed at some point, you may need to go into the folder and delete the file if it was not downloaded correctly before trying again.

The VGG16 features can be found in folders "data/coco/Train2017_vgg16_fc7" and "data/coco/Val2017_vgg16_fc7"

**Note**: If the process failed at some point, you may need to go into the "data/coco" folder and delete "train2017_vgg16_fc7" and "val2017_vgg16_fc7" before trying again

In [None]:
# produce pickle files with fc7 features and captions (words and tokens)
produceVGG16_fc7_features(myDataLoader, device)