# How to prepare PascalVOC Dataset to train an object detector

## Gathering Data
The PascalVOC Dataset used is composed of two different datasets from the years 2007 and 2012: VOC2007 and VOC2012. First of all you need to download the datasets from the official webpages:
1. VOC2007: from http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html#devkit select ''Download the training/validation data (450MB tar file)'' in the Development Kit section and ''Download the annotated test data (430MB tar file)'' in the Test Data section.
2. VOC2012: from http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html select ''Download the training/validation data (2GB tar file)'' in the Development Kit section.


The two trainval datasets, downloaded from the Development Kit section, are to be used for training, while the VOC 2007 test, the one taken from the Test Data section, will serve as test dataset.

ATTENTION: Both the VOC2007 trainval and VOC2007 test data has to be extracted in the same location, e.g. download the datasets and then merge them.

## Structure of the Dataset folder - Pascal VOC 
Once you have downloaded the aforementioned datasets, you should place them in the same folder. Hence, inside the main folder 'VOCdevkit'there should be two subfolders 'VOC2007' and 'VOC2012', where each of them contains five subfolders: 
1. Annotations: Inside this folder there are the PascalVOC formatted annotation XML files, that contain relevant information for the picture under examination. Hence, there is one XML file per image. Each XML file contains the path to the image in the 'path' element, the bounding box stored in an 'object' element and other features as can be seen in the example below. You can note that the bounding box is defined by two points, the upper left and bottom right corners.

<img src = "images/pascalvoc.PNG" style = "height:400px">

2. ImageSets: Inside this folder there are three subfolders: 'Layout', 'Main' and 'Segmentation'. In particular in the subfolder 'Main' you can find the images of a specific class that belong to the test, train or trainval subdivision.
3. JPEGImages: Here there are all the images, that has to be in the JPG format.
4. and 5. SegmentationClass and SegmentationObject: folders containing the segmentation masks for some images and objects.


## Data pipeline
Once you have the correct structure of the dataset, you should divide data into training and test splits. Data should also be saved in JSON files in order to be used inside the PyTorch Dataset class that will be created later for this purpose.

### Parse raw data - Creation JSON files & splitting of the dataset
Run (and see for more details) the create_json.py script you can find in the dataset folder. When running it, you need to provide the paths to the VOC2007 and VOC2012 folders, as well as to the desired output folder where the JSON files should be saved.

This script parses the data downloaded and returns as output the following files:
1. A JSON file for each split (Train or Test) with a list of the absolute filepaths for each image in that split.
2. A JSON file for each split (train or Test) with a list of dictionaries containing ground truth objects, i.e. bounding boxes in absolute boundary coordinates, their encoded labels, and perceived detection difficulties for each image in that split. Therefore, The i-th dictionary in this list will contain the objects present in the i-th image of the split.
3. A JSON file which contains the label_map, the label-to-index dictionary with which the labels are encoded in the previous JSON file. This dictionary is also available in the script (create_json.py) and directly importable.


In [1]:
# %run ../dataset/create_json.py path_to_VOC2007dir path_to_VOC2012dir path_to_outputfolder
%run ../dataset/create_json.py ../VOCdevkit/VOC2007/ ../VOCdevkit/VOC2012/ ./

/scratch/lmeneghe/Smithers/smithers/ml/VOCdevkit/VOC2007
/scratch/lmeneghe/Smithers/smithers/ml/VOCdevkit/VOC2007
/scratch/lmeneghe/Smithers/smithers/ml/VOCdevkit/VOC2012

There are 16551 training images containing a total of 49653          objects. Files have been saved to /scratch/lmeneghe/Smithers/smithers/ml/tutorials.

There are 4952 test images containing a total of 14856         objects. Files have been saved to /scratch/lmeneghe/Smithers/smithers/ml/tutorials.


## Defining a PascalVOCDataset class

In order to use the constructed dataset properly, we need to define a subclass of PyTorch Dataset, called PascaVOCDataset. For more details about the implementation see pascalvoc_dataset.py in the dataset folder.


The PascalVOCdataset class has been defined to detect your training and test datasets from the JSON files created above. It needs a __len__ method defined, which returns the size of the dataset, and a __getitem__ method which returns the i-th image, bounding boxes of the objects in this image, and labels for the objects in this image, using the JSON files we saved earlier.

You will notice that it also returns the perceived detection difficulties of each of these objects, but these are not actually used in training the model. They are required only in the Evaluation stage for computing the Mean Average Precision (mAP) metric. We also have the option of filtering out difficult objects entirely from our data to speed up training at the cost of some accuracy.

In [3]:
import sys
sys.path.insert(0, '/scratch/lmeneghe/Smithers/')
from smithers.ml.dataset.pascalvoc_dataset import PascalVOCDataset

keep_difficult = True

# data_folder corresponds to the output folder defined before, where the JSON files have been saved
data_folder = './'
#data_folder = '/u/s/szanin/Smithers/smithers/ml/tutorials/'
# Load train data
train_dataset = PascalVOCDataset(data_folder,
                                 split='train',
                                 keep_difficult=keep_difficult)

# Load test data
test_dataset = PascalVOCDataset(data_folder,
                                split='test',
                                keep_difficult=keep_difficult)

## Extract smaller datasets
If you want to test your model against a smaller dataset than PascalVOC, you can exctract a set of images from the original PascalVOC using ***sample_dataset.py*** in the dataset folder.

You can thus extract a dataset composed of N images divided in M classes, where N and M are less than the total number of images and classes composing the dataset under consideration. For example, we can create a dataset composed of 300 images of cats and dogs.

In [4]:
%run ../dataset/sample_dataset.py

We now have to split the subdataset in the train dataset (e.g. 80% of the total) and the test dataset (e.g. the remaining 20%). To do so we use the same procedure found in the splitting section of the tutorial ***customdata_objdet***.

We first create the directories and files needed.

Below, after the first ``cd`` command, insert the path to the folder created using the previous cell, in my case this is
``/u/s/szanin/Smithers/smithers/ml/tutorials/VOC_dog_cat/``.

In [6]:
%%bash
(   
    cd VOC_dog_cat/;
    touch datafile.txt;
    mkdir JSONfiles
    mkdir ImageSets;
    cd ImageSets;
    mkdir Main;
    cd Main;
    touch trainval.txt;
    touch test.txt
)

Now we populate the datafile.txt file with the names of the images we sampled.

Beware that:
- in the ``datafiletxt_path`` variable you need to insert the string containing the path to your datafile.txt file we have just created;
- in the ``jpeg_path`` variable you need to insert the string containing the path to your JPEGImages folder of the reduced dataset.

In [15]:
import os

datafiletxt_path = 'VOC_dog_cat/datafile.txt'
jpeg_path = 'VOC_dog_cat/JPEGImages/'

# If you are using Python < 3.9 you need this function to remove the 
# suffix jpg, otherwise you can uncomment the lines using the
# removesuffix function
def remove_suffix(input_string, suffix):
    if suffix and input_string.endswith(suffix):
        return input_string[:-len(suffix)]
    return input_string

with open(datafiletxt_path, 'w') as datafile:
    dir_list = os.listdir(jpeg_path)
    num_files = len(dir_list)
    for element in dir_list[:-1]:
        datafile.write('{}\n'.format(remove_suffix(element, '.jpg')))
        #datafile.write('{}\n'.format(element.removesuffix('.jpg')))
    datafile.write('{}'.format(remove_suffix(dir_list[-1],'.jpg')))
    #datafile.write('{}'.format(dir_list[-1].removesuffix('.jpg'))) # the last element added does not need the new line characters

The following cell will construct the .json files relative to the smaller dataset sampled.

Beware that in the variables ``train_file``and ``test_file`` you need to insert your own paths as follows:
- in ``train_file`` insert the string containing the path of your trainval.txt file we created above;
- in ``test_file`` insert the string containing the path of your test.txt file we created above;

In [17]:
import pandas as pd
import numpy
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split

train_file = 'VOC_dog_cat/ImageSets/Main/trainval.txt'
test_file = 'VOC_dog_cat/ImageSets/Main/test.txt'

with open(datafiletxt_path,'r') as f:
 # in Windows you may need to put rb instead of r mode 
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   train ,test = train_test_split(data,test_size=0.2)      
   split = [train, test] 
   # the ouputs here are two lists containing train-test split of inputs.
   lengths = [len(train), len(test)]
   out_train = open(train_file,"w")
   out_test = open(test_file, "w")
   out_file = [out_train, out_test]
   out = 0
   for l in lengths:
        for i in range(l):
            name_img = split[out][i]
            out_file[out].write(name_img + '\n')
        out_file[out].close()    
        out += 1

We will now create the .json files relative to this datasets and save them in the folder JSONfiles, inside VOC_dog_cat.

In [18]:
%run ../dataset/create_json.py ./VOC_dog_cat/ None ./VOC_dog_cat/JSONfiles

/scratch/lmeneghe/Smithers/smithers/ml/tutorials/VOC_dog_cat
/scratch/lmeneghe/Smithers/smithers/ml/tutorials/VOC_dog_cat

There are 240 training images containing a total of 720          objects. Files have been saved to /scratch/lmeneghe/Smithers/smithers/ml/tutorials/VOC_dog_cat/JSONfiles.

There are 60 test images containing a total of 180         objects. Files have been saved to /scratch/lmeneghe/Smithers/smithers/ml/tutorials/VOC_dog_cat/JSONfiles.
