# Training set creation

## Getting started

This notebook will walk you through how we created the training set to train and validate instances of sealnet. From generating a vector database to extracting patches from rasters. To recreate the trining set, you will need Qgis (tested for 2.8), Python 3.6, WV03 raster image catalog and a shapefile with a database for seal occurences.  

## Table of contents
---
* [Exporting points](#exp)
* [Extracting patches](#extract)
* [Creating scene bank](#scene)
* [Generating synthetic seal images](#synth)
    

## Exporting points <a name="exp"></a>
---

Before we can extract patches from raster files, we need to export our database of seal points to a csv file. This operation requires the MMQGIS plugin and can be done by opening Qgis with the seal points shape file and selecting the following option: 

<img src="jupyter_notebook_images/export_geometry.png">

If you selected the correct shape file as the input layer, you will be prompted to save a .csv file. Keep the default name and move this .csv file to the root directory of this repository. 

## Extracting patches<a name="extract"></a>
---

Once the output from the geometry export is in the repository root, we are ready to extract patches from the raster files. Run the following cell to extract patches and create the training sets. This process will take around 20 minutes and requires at least 16GB of RAM. 

In [3]:
# point to folder with raster images
raster_dir = '/home/bento/training_set_scenes'

# training set vanilla

# point to shapefile
shape_file = 'temp-nodes.csv'
# specify training labels, separate each class by an '_', classes need to correspond to ones in the shapefile
labels = 'crabeater_crack_emperor_glacier_ice-sheet_marching-emperor_open-water_pack-ice_rock_shadow_weddell'
# specify labels of classes used for detection
det_classes = 'crabeater_weddell'

%run create_trainingset.py --det_classes=$det_classes --rasters_dir=$raster_dir --scale_bands='450' --out_folder='training_set_vanilla' --labels=$labels --shape_file=$shape_file --rgb='0'



Creating training_set_vanilla_grayscale:

Checking input folder for invalid files:


  WV03_20141120214545_1040010005B62F00_14NOV20214545-P1BS-500258392060_01_P001_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20141107053013_10400100046B2800_14NOV07053013-P1BS-500268574010_01_P006_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20160225140323_10400100196BE200_16FEB25140323-P1BS-500638709010_01_P009_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20160210052144_10400100184CC500_16FEB10052144-P1BS-500687302080_01_P001_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20160227062822_10400100181F9B00_16FEB27062822-P1BS-500638715080_01_P001_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20160227062810_10400100191DC900_16FEB27062810-P1BS-500638658040_01_P002_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20151105045105_104001001323F200_15NOV05045105-P1BS-500658675020_01_P002_u08rf3031.tif.aux.xml is not a valid scene.
  beagle2.tif is not an annotated scene.
  WV03_20170

## Creating scene bank<a name="scene"></a>
---

One step in evaluating sealnet instances is defining how well the models can identify scenes that contain seals. In order to measure that we need to create a 'scene bank' which stores which scenes count as positives (i.e. has seals) or negatives. To generate scene banks, run the cell bellow: 

In [None]:
# creating seal scene banks
for label in ['crabeater', 'weddell', 'emperor', 'marching-emperor']:
    out_file = "{}_scene_bank.csv".format(label)
    %run create_scene_bank.py --positive_classes=$label --out_file=$out_file




## Synthesizing new seal haul out images<a name="synth"></a>
---

We can also create synthetic images by cropping seals into different seal backgrounds. This is an easy way to create testing images, which should have a clear answer and were not seen during training. We can do this in two steps: *1) generate seal and background image banks; 2) insert a random number of seals into background images*  It might also be a useful way to generate more training data, given that the seal and background image banks are sufficiently rich and accurately represent a distribution of real seal images. 

### Creating seal bank

The following cell loops through the seal point catalog and crops individual seals, saving images with a single seal into a 'seal_bank' folder inside 'training_sets'. This will also save the mean and standard deviation for the distance to the nearest neighbor, which will be later used to decide how clumped individual seals can be in synthesized images 

In [None]:
# generate seal bank for crabeaters and weddells
for spcs in ['weddell', 'crabeater']:
    %run create_seal_bank.py --training_dir='training_set_vanilla' \
                             --out_folder='./training_sets/training_set_synthesized/seal_bank' \
                             --label=$spcs
                    
