# Data preparation
---

<span style="color:red"><b>Note: Do NOT run any of these cells/scripts on UIO IFI computers! (Explanation below)  </b></span> 

In this mandaroty exercise you are implementing an image captioning network. For training and validation data you will need images with corresponding descriptions. The dataset that you will use is the "Common Object in Context" (COCO) 2017. You will also need pretrained weights form the VGG16 network.

If you are working on a UIO IFI computer data will be avaibale for you on the project disk. The *path* is given to you in the assigment. The dataset is large (~18GB) and every student cannot download it on the UIO IFI computers. It also takes too long time to produce VGG16 features which is needed for the imaging captioning task. However, if you are working on your own computer, you will need to follow the steps in this notebook to be able to complete the exercise. <span style="color:orange">Downloading the dataset, generating the vocabulary and processing VGG16 features will take a long time. It will depend on your internet connection and compute power, but it can be a good idea to run this notebook over night. </span> 

This notebook will help you with:
- Downloading and unzipping training and validation data from the COCO 2017 dataset
- Generating a vocabulary dictionary holding information about the captions and the corresponding tokens.
- Downloading and unzipping the VGG16 weights
- Produce and store features from the secound fully connected layer in the VGG16 network for all train and validation images.





Links:
- [Step1: Download COCO dataset](#Task1)
- [Step2: Generate vocabulary](#Task2)
- [Step3: Download VGG16 weights](#Task3)
- [Step4: Produce VGG16 features](#Task4)


Software version:
- Python 3.6
- TensorFlow 1.4.0


---


<a id='Task1'></a>
### Step1: Download COCO dataset

The data can be found in folder "data/coco". Subfolder e.g. "train2017" contains the training images as jpg files.


**Note**: If the process failed at some point, you may need to go into the "data/coco" folder and delete the files which were not downloaded correctly before trying again.


In [2]:
from utils import coco

#Create dataClass
myCocoDataClass = coco.CocoImagesDataClass()

# Set data directory
data_dir="data/coco/"
myCocoDataClass.set_data_dir(data_dir)


# Download coco dataset
myCocoDataClass.maybe_download_and_extract_coco()


Downloading http://images.cocodataset.org/zips/train2017.zip
Data has apparently already been downloaded and unpacked.
Downloading http://images.cocodataset.org/zips/val2017.zip
Data has apparently already been downloaded and unpacked.
Downloading http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Data has apparently already been downloaded and unpacked.


---

<a id='Task2'></a>
### Step2: Generate vocabulary ###


The vocabulary will be stored as a pickle file at "data/coco/vocabulary"

**Note**: If the process failed at some point, you may need to go into the "data/coco/vocabulary" folder and delete the file if it was not downloaded correctly before trying again.

In [3]:
# Load records
myCocoDataClass.load_records(trainSet=True)
myCocoDataClass.load_records(trainSet=False)

# Generate vocabulary
myCocoDataClass.generate_vocabulary()

The file "vocabulary.pickle" has already been produced.


---

<a id='Task3'></a>
### Step3: Download vgg16 weights ###

The pretrained weights will be stored in folder "model/VGG16" as a .ckpt file

**Note**: If the process failed at some point, you may need to go into the "model\VGG16" folder and delete the file if it was not downloaded correctly before trying again.

In [4]:
# Download vgg16 weights
myCocoDataClass.maybe_download_and_extract_vgg16weights()

Downloading http://download.tensorflow.org/models/vgg_16_2016_08_28.tar.gz
Data has apparently already been downloaded and unpacked.


---

<a id='Task4'></a>
### Step4: Produce VGG16 features ###


The data can be found in folder "data/coco". The subfolder e.g. "Train2017_vgg16_fc7" contains pickle files for each training example.

**Note**: If the process failed at some point, you may need to go into the "data/coco" folder and delete "train2017_vgg16_fc7" and "val2017_vgg16_fc7" before trying again


In [None]:
#Produce pickle files with VGG16 features
myCocoDataClass.produceVgg16Fc7()

INFO:tensorflow:Restoring parameters from data/coco/CNN/vgg_16.ckpt


Generate: Train pickle files:   0%|          | 0/1849 [00:00<?, ?it/s]