# Training set creation

## Getting started

This notebook will walk you through how we created the training set to train and validate instances of sealnet. From generating a vector database to extracting patches from rasters. To recreate the trining set, you will need Qgis (tested for 2.8), Python 3.6, WV03 raster image catalog and a shapefile with a database for seal occurences.  

## Table of contents
---
* [Exporting points](#exp)
* [Extracting patches](#extract)
* [Creating scene bank](#scene)
* [Generating synthetic seal images](#synth)
    

## Exporting points <a name="exp"></a>
---

Before we can extract patches from raster files, we need to export our database of seal points to a csv file. This operation requires the MMQGIS plugin and can be done by opening Qgis with the seal points shape file and selecting the following option: 

<img src="jupyter_notebook_images/export_geometry.png">

If you selected the correct shape file as the input layer, you will be prompted to save a .csv file. Keep the default name and move this .csv file to the root directory of this repository. 

## Extracting patches<a name="extract"></a>
---

Once the output from the geometry export is in the repository root, we are ready to extract patches from the raster files. Run the following cell to extract patches and create the training sets. This process will take around 20 minutes and requires at least 16GB of RAM. 

In [None]:
# point to folder with raster images
raster_dir = '/home/bento/imagery'
# point to shapefile
shape_file = 'seal_points_espg3031.csv'
# specify training labels, separate each class by an '_', classes need to correspond to ones in the shapefile
labels = 'crabeater_crack_emperor_glacier_ice-sheet_marching-emperor_open-water_other_pack-ice_rock_weddell'
# specify labels of classes used for detection
det_classes = 'crabeater_weddell'

# create vanilla training set (spatial bands = 450, 450, 450)
%run create_trainingset.py --det_classes=$det_classes --rasters_dir=$raster_dir --scale_bands='450_450_450' --out_folder='training_set_vanilla' --labels=$labels --shape_file=$shape_file 

# create multiscale training set (spatial bands = 450, 1350, 4000)
%run create_trainingset.py --det_classes=$det_classes --rasters_dir=$raster_dir --scale_bands='450_1350_4000' --out_folder='training_set_multiscale_A' --labels=$labels --shape_file=$shape_file


Creating training_set_multiscale_A:

Checking input folder for invalid files:


  Untitled Document is not a valid scene.
  other.qpj is not a valid scene.
  seal_points.shp is not a valid scene.
  other.shx is not a valid scene.
  ae.qpj is not a valid scene.
  ae.shp is not a valid scene.
  seals_wv3.qgs~ is not a valid scene.
  other.prj is not a valid scene.
  SG_sealpoints.dbf is not a valid scene.
  ae.dbf is not a valid scene.
  filtered_scenes_WV03_PAN_ALL.csv is not a valid scene.
  ae.shx is not a valid scene.
  SG_sealpoints.shp is not a valid scene.
  other.dbf is not a valid scene.
  SG_sealpoints.prj is not a valid scene.
  seal_points.prj is not a valid scene.
  seal_points.qpj is not a valid scene.
  seal_points.shx is not a valid scene.
  seal_points.dbf is not a valid scene.
  SG_sealpoints.shx is not a valid scene.
  SG_sealpoints.qpj is not a valid scene.
  filtered_scenes_WV03_PAN_P001.csv is not a valid scene.
  other.shp is not a valid scene.
  ae.prj is not a v

  WV03_20141008171556_104001000281A100_14OCT08171556-P1BS-500258406090_01_P003_u08rf3031.tif is not an annotated scene.
  WV03_20151009060254_1040010012714300_15OCT09060254-P1BS-500652424060_01_P002_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20151005050119_1040010011172D00_15OCT05050119-P1BS-500638962090_01_P005_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20151030210819_104001001351DB00_15OCT30210819-P1BS-500658695070_01_P001_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20141020184444_10400100031B9000_14OCT20184444-P1BS-500258392140_01_P008_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20151029173949_1040010013AFB500_15OCT29173949-P1BS-500658659100_01_P001_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20151015203404_1040010012A43300_15OCT15203404-P1BS-500652486050_01_P001_u08rf3031.tif.aux.xml is not a valid scene.
  WV03_20141007120923_1040010002BF0000_14OCT07120923-P1BS-500268527140_01_P006_u08rf3031.tif is not an annotated scene.
  WV03_20151009060427_

  patch_sizes=patch_sizes, labels=labels)
in singular transformations; automatically expanding.
bottom=-0.5, top=-0.5
  'bottom=%s, top=%s') % (bottom, top))
in singular transformations; automatically expanding.
left=-0.5, right=-0.5
  'left=%s, right=%s') % (left, right))



  Processed 1 out of 34 rasters

  Processed 2 out of 34 rasters

  Processed 3 out of 34 rasters

  Processed 4 out of 34 rasters

  Processed 5 out of 34 rasters

  Processed 6 out of 34 rasters

  Processed 7 out of 34 rasters

  Processed 8 out of 34 rasters

  Processed 9 out of 34 rasters

  Processed 10 out of 34 rasters

  Processed 11 out of 34 rasters

  Processed 12 out of 34 rasters

  Processed 13 out of 34 rasters

  Processed 14 out of 34 rasters

  Processed 15 out of 34 rasters

  Processed 16 out of 34 rasters

  Processed 17 out of 34 rasters

  Processed 18 out of 34 rasters


## Creating scene bank<a name="scene"></a>
---

One step in evaluating sealnet instances is defining how well the models can identify scenes that contain seals. In order to measure that we need to create a 'scene bank' which stores which scenes count as positives (i.e. has seals) or negatives. To generate scene banks, run the cell bellow: 

In [23]:
# creating seal scene banks
for label in ['crabeater', 'weddell', 'emperor', 'marching-emperor']:
    out_file = "{}_scene_bank.csv".format(label)
    %run create_scene_bank.py --positive_classes=$label --out_file=$out_file




## Synthesizing new seal haul out images<a name="synth"></a>
---

We can also create synthetic images by cropping seals into different seal backgrounds. This is an easy way to create testing images, which should have a clear answer and were not seen during training. We can do this in two steps: *1) generate seal and background image banks; 2) insert a random number of seals into background images*  It might also be a useful way to generate more training data, given that the seal and background image banks are sufficiently rich and accurately represent a distribution of real seal images. 

### Creating seal bank

The following cell loops through the seal point catalog and crops individual seals, saving images with a single seal into a 'seal_bank' folder inside 'training_sets'. This will also save the mean and standard deviation for the distance to the nearest neighbor, which will be later used to decide how clumped individual seals can be in synthesized images 

In [2]:
# generate seal bank for crabeaters and weddells
for spcs in ['weddell', 'crabeater']:
    %run create_seal_bank.py --training_dir='training_set_vanilla' \
                             --out_folder='./training_sets/training_set_synthesized/seal_bank' \
                             --label=$spcs
                    



Generating weddell seal bank:
  extracted 0 out of weddell 983 seal images
  extracted 100 out of weddell 983 seal images
  extracted 200 out of weddell 983 seal images
  extracted 300 out of weddell 983 seal images
  extracted 400 out of weddell 983 seal images
  extracted 500 out of weddell 983 seal images
  extracted 600 out of weddell 983 seal images
  extracted 700 out of weddell 983 seal images
  extracted 800 out of weddell 983 seal images
  extracted 900 out of weddell 983 seal images

Generating crabeater seal bank:
  extracted 0 out of crabeater 4238 seal images
  extracted 100 out of crabeater 4238 seal images
  extracted 200 out of crabeater 4238 seal images
  extracted 300 out of crabeater 4238 seal images
  extracted 400 out of crabeater 4238 seal images
  extracted 500 out of crabeater 4238 seal images
  extracted 600 out of crabeater 4238 seal images
  extracted 700 out of crabeater 4238 seal images
  extracted 800 out of crabeater 4238 seal images
  extracted 900 out 